1 Preamble

This is a small excerpt of the data analysis I conducted for my dissertation project. It uses data I collected experimentally online for my dissertation project. The aim is to demonstrate my skills in

R Programming
Statistical Data Analysis
Data Transformation
Data Visualization
Effective Communication of Complex Concepts and Insights from Data

I am keeping the use of technical linguistics terms to a minimum so that readers without a background in linguistics can also follow.

2 Introduction

My dissertation revolves around stress in complex words such as identifiable. Clearly this word consists of the root identify plus the suffix -able. For this reason linguists call a word like identifiable a complex word.

When stressing such a word a speaker generally has two options: they may use the stress of the root word (in this case idéntify) when assigning stress to the complex word. This would mean they say idéntifiable. Another option would be to disregard the stress of the root word and just take identifiable as one unit without seeing identify in it. This may result in a pronunciation such as identifíable, for instance. This pronunciation clearly shows no relationship to the stress of the root word idéntify. If you observe speakers of English, you will find that some speakers say idéntifiable (establishing a stress relationship to the root word) while others say identifíable (making no reference to the stress of the root word). The goal of my dissertation project was to find out where this stress variation in complex words comes from and why some speakers use the stress of the root word and others do not.

In order to answer this question, I needed to learn many things about the speakers who participated in my study. One thing I needed to know was how they tended to process complex words consisting of a root and a suffix just like identifiable. I was interested in finding out whether individual speakers processed these words differently so that I could ultimately explore the effect this may have on the way they stress morphologically complex words. In this demonstration I will focus on showing the results of the analysis of the priming task.

3 Method

I set up a masked-priming experiment with a lexical decision task to gain insight into this. In this experimental design speakers would see a so-called forward mask of hashtags (#) and then very briefly (50ms) see a prime word (for example celebration) (hence “masked priming”). Afterwards they saw a target word (for example celebrate) which would stay on the screen longer. It was then their task to indicate whether they thought this target word was a real word of the English language or not (hence “lexical decision task”). The idea behind this design is that speakers who tend to process their complex words in two parts (root + affix) already see and activate the word celebrate in their minds when I show them the complex word celebration. When I show them only the word celebrate afterwards, these speakers should be able to make a decision about its word status (i.e. react to this stimulus) faster because celebration already co-activated celebrate for them. Speakers who do not process their complex words in two parts, on the other hand, and only activate celebration in their minds when I show them celebration without also activating celebrate should react more slowly to celebrate than the former group of speakers.

In order to test for the presence of these processing effects I had to present speakers with test stimuli, distraction stimuli, and control stimuli

Morphologically Related Stimuli: celebration - celebrate. The root word of celebration is celebrate. The two words are related via the form and meaning of this root word. That is why linguists call them morphologically related words. These are the test stimuli.
Orthographically Related Stimuli: intensity - interfere. Orthography refers to spelling. As you can see these two words are spelled somewhat similarly so we will call them orthographically related here. But they are clearly not morphologically related like celebration and celebrate. These are distraction/control stimuli.
Unrelated Stimuli: attributable - irritate. These two words clearly have nothing to do with each other at all. They are neither morphologically nor orthographically related. These are control stimuli.

I measured speakers’ reaction times to these three different types of stimuli. In total the experiment presented speakers with 60 different word pairs. I expect that speakers who break complex words down as they process them should react fastest to the morphologically related stimuli, second fastest to the orthographically related stimuli, and slowest to the unrelated stimuli. Speakers who do not break down the complex words to process them likely also will not show any major differences in their reaction times to the three different stimulus groups.

4 Preparing the Analysis

4.1 Setting Up and Loading the Data

I like to create a clean slate at the start of all my scripts so we will clear any plots, the console, and the environment to start with.

graphics.off() #clear plots

cat("\014") #clear console

rm(list=ls()) #clear environment

Let’s first set our working directory. Please adjust this code for running it on your machine and specify the file path where you saved the data frames “MetaData.csv” and “MorphSens.csv”.

setwd("C:/Users/Tammy/Coding Portfolio/R_Project")

I also want to use custom color palettes for plotting later so I will specify my color palettes here.

TwoColors <- c("#7BCCC4","#0868AC")
ThreeColors <- c("#084081","#2B8CBE","#7BCCC4")

We will need two data frames: MetaData.csv and MorphSens.csv.

Let’s load them both and assign them to objects for efficient handling.

Meta <- read.csv("MetaData.csv", 
                 header = TRUE, 
                 sep = ",", 
                 stringsAsFactors = TRUE)

Morph <- read.csv("MorphSens.csv", 
                  header = TRUE, 
                  sep = ",",
                  stringsAsFactors = TRUE)

?knitr::knit

## starting httpd help server ... done

4.2 Inspecting the Data Frames

Meta contains all the socio-demographic data about my speakers. Morph contains the data collected using the masked priming lexical decision task described above.

Now we can inspect the two data frames using the head() function. Because both data frames are quite big, I am going to restrict the output of head() to just one line for each data frame. I am also going to use the dim() function to get the number of rows and columns in each data frame.

dim(Meta)

## [1] 156  66

head(Meta,1)

##       Participant ParticipantShort RecSession Comments   Status DeviceType
## 1 Participant_001            P_001     234733          accepted    desktop
##   AudioEquip VSTScore VSTCorrRespNum ARTScore Age CountryOfBirth
## 1 JustDevice    13600             68        2  26 United Kingdom
##   CurrentCountryOfResidence CurrentCountryLabvanced    Nationality
## 1            United Kingdom                 England United Kingdom
##                                                                                                 CurrentUKAreaResidence
## 1 South East, England (Berkshire, Buckinghamshire, and Oxfordshire, Surrey, Sussex, Kent, Hampshire and Isle of Wight)
##   CurrentUKAreaResSHORT CurrentLivingLabvanced CurrentRegionLabvanced
## 1                  SE-E     Lancaster, England             North_West
##   CurrentRegionSimpLabv CurrentlyLivesInUKLabvanced GrowUpCountryLabv
## 1                 North                         Yes           England
##   GrowUpDetailLabv GrowUpRegionLabv GrowUpRegionSimpLabv GrewUpInUKLabvanced
## 1    England, Kent       South_East                South                 Yes
##   EmploymentStatus EmploymentStatusScored HouseholdIncomeGBP PersonalIncomeGBP
## 1            Other                      0    10,000 - 15,999   10,000 - 19,999
##        CurrentEducationProgram CurrentEducationProgramScored StudentStatus
## 1 Doctorate degree (PhD/other)                             5           Yes
##   HighestEduLabvanced HighestEduLabvancedScored HighestEduObtainedLabvanced
## 1        Master-Level                         3                        2019
##   YearsOfSchoolingLabvanced FirstLang FirstLangLabvanced
## 1                        11   English            English
##   FirstLangLabvancedDetail FluentLangs FluentLangsBinary NumberFluentLangs
## 1                  English     English      only_English                 1
##   ForeignLangsLabv NumberForeignLangsLabv ForeignLangsYrsStudLabv
## 1             none                      0                       0
##   ForeignLangsHrsProdLabv ForeignLangsHrsPercepLabv LangStudLabv LangProfLabv
## 1                       0                         0           no           no
##   GenderIdentity    Sex   Handedness HearingDifficulties LiteracyDifficulties
## 1         Female Female Right-handed                  No                  Yes
##   Vision FictionReadingDetailLabv FictionReadingScored
## 1    Yes         1-3 days a month                    4
##                NewspaperDetailLabv NewspaperNumberLabv NewspWklyMinsLabv
## 1 TheSun_DailyMail_EveningStandard                   3        60-90 mins
##   NewspWklyMinsScored NonFicTimeDetailLabv NonFicTimeScored MusicWklyHrsLabv  X
## 1                   3     1-3 days a month                5                1 NA

dim(Morph)

## [1] 9360   36

head(Morph,1)

##       Participant RecSession   Status DeviceType TrialNr TrialID LabvRT
## 1 Participant_001     234733 accepted    desktop       4       1   2145
##   ParticipantResponse    Prime Target NeighborsPrime NumNeighborsPrime
## 1                   A marmable  GWOAP   not_relevant                 0
##   NeighborsTarget NumNeighborsTarget FamilySizePrime FamilySizeTarget
## 1    not_relevant                  0               0                0
##   LevenshteinDistPT_caseSensitive LevenshteinDistPT_caseInsensitive
## 1                               8                                 7
##   ParticipantRespScored LexicalDecision LexicalDecisionYN TargetStatus
## 1               correct               a                 n        nonce
##   Morphological Orthographical Condition Control BNCPrimeFreq BNCTargetFreq
## 1       morph_n        ortho_n      mnon   con_n            0             0
##   BNC14PrimeFreq BNC14TargetFreq Suffix PrimeLength TargetLength
## 1              0               0   able           8            5
##   CommentParticipant TimeDelayOffset_MilSecs TimeMeasureSTD_inMilSecs
## 1                                        0.6                      0.9

As you can see we have quite a large number of variables in each data frame and Morph has a large number of observations at 9360. Meta is smaller at only 156 observations. Not all of the variables shown here will be relevant for our analysis. I will explain each variable in more detail before presenting analysis results that rely on it. We also have categorical variables (see Meta$Status) as well as numerical variables (see Morph$LabvRT) in each of our two data frames.

We will merge the data frames so that we can analyze the socio-demographic data and the experimental data together effectively and see how the speakers’ socio-demographic characteristics relate to their performance on the experimental task. I will use the inner_join() function from the dplyr package and assign the result of the join to the object MS. Every participant has a unique ID assigned to them in the variable Participant that is consistent across all data frames in my dissertation project. So we will use this variable to join the data frames.

MS <- inner_join(Morph, Meta, "Participant")

4.3 Cleaning the Data

Let’s check the column names of the merged data frame MS.

colnames(MS)

##   [1] "Participant"                       "RecSession.x"                     
##   [3] "Status.x"                          "DeviceType.x"                     
##   [5] "TrialNr"                           "TrialID"                          
##   [7] "LabvRT"                            "ParticipantResponse"              
##   [9] "Prime"                             "Target"                           
##  [11] "NeighborsPrime"                    "NumNeighborsPrime"                
##  [13] "NeighborsTarget"                   "NumNeighborsTarget"               
##  [15] "FamilySizePrime"                   "FamilySizeTarget"                 
##  [17] "LevenshteinDistPT_caseSensitive"   "LevenshteinDistPT_caseInsensitive"
##  [19] "ParticipantRespScored"             "LexicalDecision"                  
##  [21] "LexicalDecisionYN"                 "TargetStatus"                     
##  [23] "Morphological"                     "Orthographical"                   
##  [25] "Condition"                         "Control"                          
##  [27] "BNCPrimeFreq"                      "BNCTargetFreq"                    
##  [29] "BNC14PrimeFreq"                    "BNC14TargetFreq"                  
##  [31] "Suffix"                            "PrimeLength"                      
##  [33] "TargetLength"                      "CommentParticipant"               
##  [35] "TimeDelayOffset_MilSecs"           "TimeMeasureSTD_inMilSecs"         
##  [37] "ParticipantShort"                  "RecSession.y"                     
##  [39] "Comments"                          "Status.y"                         
##  [41] "DeviceType.y"                      "AudioEquip"                       
##  [43] "VSTScore"                          "VSTCorrRespNum"                   
##  [45] "ARTScore"                          "Age"                              
##  [47] "CountryOfBirth"                    "CurrentCountryOfResidence"        
##  [49] "CurrentCountryLabvanced"           "Nationality"                      
##  [51] "CurrentUKAreaResidence"            "CurrentUKAreaResSHORT"            
##  [53] "CurrentLivingLabvanced"            "CurrentRegionLabvanced"           
##  [55] "CurrentRegionSimpLabv"             "CurrentlyLivesInUKLabvanced"      
##  [57] "GrowUpCountryLabv"                 "GrowUpDetailLabv"                 
##  [59] "GrowUpRegionLabv"                  "GrowUpRegionSimpLabv"             
##  [61] "GrewUpInUKLabvanced"               "EmploymentStatus"                 
##  [63] "EmploymentStatusScored"            "HouseholdIncomeGBP"               
##  [65] "PersonalIncomeGBP"                 "CurrentEducationProgram"          
##  [67] "CurrentEducationProgramScored"     "StudentStatus"                    
##  [69] "HighestEduLabvanced"               "HighestEduLabvancedScored"        
##  [71] "HighestEduObtainedLabvanced"       "YearsOfSchoolingLabvanced"        
##  [73] "FirstLang"                         "FirstLangLabvanced"               
##  [75] "FirstLangLabvancedDetail"          "FluentLangs"                      
##  [77] "FluentLangsBinary"                 "NumberFluentLangs"                
##  [79] "ForeignLangsLabv"                  "NumberForeignLangsLabv"           
##  [81] "ForeignLangsYrsStudLabv"           "ForeignLangsHrsProdLabv"          
##  [83] "ForeignLangsHrsPercepLabv"         "LangStudLabv"                     
##  [85] "LangProfLabv"                      "GenderIdentity"                   
##  [87] "Sex"                               "Handedness"                       
##  [89] "HearingDifficulties"               "LiteracyDifficulties"             
##  [91] "Vision"                            "FictionReadingDetailLabv"         
##  [93] "FictionReadingScored"              "NewspaperDetailLabv"              
##  [95] "NewspaperNumberLabv"               "NewspWklyMinsLabv"                
##  [97] "NewspWklyMinsScored"               "NonFicTimeDetailLabv"             
##  [99] "NonFicTimeScored"                  "MusicWklyHrsLabv"                 
## [101] "X"

The output of colnames() shows us that we have some duplicate variables now ending in either .x or .y. This is a bit annoying and not helpful. Let’s get rid of the duplicates.

MS <- MS %>% 
  rename_at(
    vars(ends_with(".x")), #find all variables that end in .x
    ~str_replace(., "\\..$","") #replace .x with nothing
  ) %>% 
  select_at( 
    vars(-ends_with(".y")) #find all variables that end in .y and remove them
  )

We can also get rid of the variable X. It is a by-product of the merging and contains no useful information.

summary(MS$X)

##    Mode    NA's 
## logical    9360

MS$X <- NULL

I will check my work one more time to make sure everything is really the way I want it.

colnames(MS)

##  [1] "Participant"                       "RecSession"                       
##  [3] "Status"                            "DeviceType"                       
##  [5] "TrialNr"                           "TrialID"                          
##  [7] "LabvRT"                            "ParticipantResponse"              
##  [9] "Prime"                             "Target"                           
## [11] "NeighborsPrime"                    "NumNeighborsPrime"                
## [13] "NeighborsTarget"                   "NumNeighborsTarget"               
## [15] "FamilySizePrime"                   "FamilySizeTarget"                 
## [17] "LevenshteinDistPT_caseSensitive"   "LevenshteinDistPT_caseInsensitive"
## [19] "ParticipantRespScored"             "LexicalDecision"                  
## [21] "LexicalDecisionYN"                 "TargetStatus"                     
## [23] "Morphological"                     "Orthographical"                   
## [25] "Condition"                         "Control"                          
## [27] "BNCPrimeFreq"                      "BNCTargetFreq"                    
## [29] "BNC14PrimeFreq"                    "BNC14TargetFreq"                  
## [31] "Suffix"                            "PrimeLength"                      
## [33] "TargetLength"                      "CommentParticipant"               
## [35] "TimeDelayOffset_MilSecs"           "TimeMeasureSTD_inMilSecs"         
## [37] "ParticipantShort"                  "Comments"                         
## [39] "AudioEquip"                        "VSTScore"                         
## [41] "VSTCorrRespNum"                    "ARTScore"                         
## [43] "Age"                               "CountryOfBirth"                   
## [45] "CurrentCountryOfResidence"         "CurrentCountryLabvanced"          
## [47] "Nationality"                       "CurrentUKAreaResidence"           
## [49] "CurrentUKAreaResSHORT"             "CurrentLivingLabvanced"           
## [51] "CurrentRegionLabvanced"            "CurrentRegionSimpLabv"            
## [53] "CurrentlyLivesInUKLabvanced"       "GrowUpCountryLabv"                
## [55] "GrowUpDetailLabv"                  "GrowUpRegionLabv"                 
## [57] "GrowUpRegionSimpLabv"              "GrewUpInUKLabvanced"              
## [59] "EmploymentStatus"                  "EmploymentStatusScored"           
## [61] "HouseholdIncomeGBP"                "PersonalIncomeGBP"                
## [63] "CurrentEducationProgram"           "CurrentEducationProgramScored"    
## [65] "StudentStatus"                     "HighestEduLabvanced"              
## [67] "HighestEduLabvancedScored"         "HighestEduObtainedLabvanced"      
## [69] "YearsOfSchoolingLabvanced"         "FirstLang"                        
## [71] "FirstLangLabvanced"                "FirstLangLabvancedDetail"         
## [73] "FluentLangs"                       "FluentLangsBinary"                
## [75] "NumberFluentLangs"                 "ForeignLangsLabv"                 
## [77] "NumberForeignLangsLabv"            "ForeignLangsYrsStudLabv"          
## [79] "ForeignLangsHrsProdLabv"           "ForeignLangsHrsPercepLabv"        
## [81] "LangStudLabv"                      "LangProfLabv"                     
## [83] "GenderIdentity"                    "Sex"                              
## [85] "Handedness"                        "HearingDifficulties"              
## [87] "LiteracyDifficulties"              "Vision"                           
## [89] "FictionReadingDetailLabv"          "FictionReadingScored"             
## [91] "NewspaperDetailLabv"               "NewspaperNumberLabv"              
## [93] "NewspWklyMinsLabv"                 "NewspWklyMinsScored"              
## [95] "NonFicTimeDetailLabv"              "NonFicTimeScored"                 
## [97] "MusicWklyHrsLabv"

Good. No more duplicates. No more unnecessary variables.

This was only a minor issue about the data that we had to clean up. Now let’s turn to the others, some of which are more complex.

A small number of speakers participated in my experiment twice due to technical issues. Thanks to the time codes provided by the experimental software I was able to distinguish their first tries from their second tries in the data. Let’s exclude the first incomplete tries and include only the complete second tries of these speakers. This information is encoded in the variable Status and complete attempts are marked accepted here.

summary(MS$Status)

## accepted rejected 
##     9180      180

MS1 <- MS %>%
  filter(Status == "accepted") %>% #get all obs with Status "accepted"
  droplevels() #drop empty levels

summary(MS1$Status)

## accepted 
##     9180

One of the most important variables we have in this data frame is LabvRT. It encodes the participants’ reaction times to the presented stimuli. As is typical of reaction time data, this variable will need to be checked and cleaned thoroughly.

First, we need to subtract 1550 ms from each reaction time measurement because this was the time it took for the crucial stimulus to show up on the screen in the first place. This is specific to my experiment and due to the allowances of the experimental software.

summary(MS1$LabvRT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      15    2248    2378    2465    2573   11678

MS1$RT <- (MS1$LabvRT - 1550)

summary(MS1$RT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1535.0   698.0   827.5   915.4  1023.0 10128.0

We have some negative reaction times now because some participants clicked through this task. Let’s exclude these negative reaction times right away. They are not useful for us.

MS1 <- MS1 %>%
  filter(RT > 0) %>% #get all obs with RT greater than 0
  droplevels() #drop empty levels

summary(MS1$RT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   359.0   700.0   829.0   929.7  1025.0 10128.0

Since this was an online experiment, each participant did the experiment on their own device. This can create issues for reaction time measurement because some devices may be slower than others. The experimental software I used (Labvanced) provides metrics for assessing the accuracy of the measured reaction times. I will clean the reaction times further according to Labvanced’s recommendations.

The median of the variable TimeMeasureSTD_inMilSecs should be 12 ms according to Labvanced. They recommend excluding all values that are 2-3 times above this median value. Therefore I am excluding all values above 24.

summary(MS1$TimeMeasureSTD_inMilSecs)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      2.0      3.4   1513.5      7.1 223212.3

MS2 <- MS1 %>%
  filter(TimeMeasureSTD_inMilSecs <= 24) %>% #get all values less than or equal to 24
  droplevels() #drop empty levels

summary(MS2$TimeMeasureSTD_inMilSecs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.800   3.000   3.764   4.500  17.500

The second relevant metric for making sure the reaction time measurement was accurate is TimeDelayOffset_MilSecs. Here Labvanced recommends the median should be 20 ms at most. All values in the present data are below this value so we are good to go here.

summary(MS2$TimeDelayOffset_MilSecs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.80    1.10    1.49    1.70    8.50

Next, since this was a lexical decision task (Is this a real word in English? Yes or no?) we also had to have some made up words in here like cetenise, gloach, and tront. This means participants could be either right or wrong in their judgments of the stimuli. We are only interested in their correct responses.

MS3 <- MS2 %>%
  filter(ParticipantRespScored == "correct") %>% #get only correct obs
  droplevels() #drop empty levels

summary(MS3$ParticipantRespScored)

## correct 
##    6915

And we are also only interested in their reaction times for the real words not the invented words.

table(MS3$TargetStatus)

## 
## nonce  real 
##  3447  3468

MS4 <- MS3 %>%
  filter(TargetStatus == "real") %>% #get only real targets
  droplevels() #drop empty levels

summary(MS4$TargetStatus)

## real 
## 3468

Fewer speakers (only three) participated on mobile devices than I expected.

MS4 %>%
  filter(DeviceType == "mobile") %>% #get only mobile participants
  distinct(Participant) #show me the unique Participant IDs for this subset

##       Participant
## 1 Participant_003
## 2 Participant_030
## 3 Participant_093

The problem is also that they were much slower than the desktop participants.

MS4 %>%
  ggplot(aes(x=DeviceType, 
              y=RT, 
              fill=DeviceType)) +
  geom_boxplot() +
  scale_fill_manual(values = TwoColors) +
  geom_jitter(color="black", 
              size=0.2, 
              alpha=0.9) +
  theme_classic() +
  theme(
    legend.position="none",
    text = element_text(size = 18)) +
  ggtitle("RT by Device Type") +
  xlab("")

With just three participants it’s difficult to reliably determine why they are so much slower. For instance, it’s not like they are particularly old.

MS4 %>%
  filter(Participant %in% c("Participant_003", 
                            "Participant_030",
                            "Participant_093")) %>% #get these participants
  summarise( #make me a summary for these participants with
  Min = min(Age), #the minimum age
  Q1 = quantile(Age, 0.25), #the age at Q1
  Median = median(Age), #the median age
  Mean = round(mean(Age),2), #the mean age
  Q3 = quantile(Age, 0.75), #the age at Q3
  Max = max(Age) #the maximum age
)

##   Min Q1 Median  Mean Q3 Max
## 1  28 28     33 34.67 43  43

#compare that to the age across all participants
summary(MS4$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   21.00   26.00   30.62   37.00   77.00

So I will exclude the mobile participants here just to make sure there are no confounding factors.

MS5 <- MS4 %>%
  filter(DeviceType != "mobile") %>%
  droplevels()

We can also exclude the control items where the prime and target are only orthographically related. These are all observations where the variable Condition is mnoy (short for: morphologically related no, orthographically related yes). Before we exclude them though, note that we almost got the reaction time ranking we expected in the beginning across the different priming conditions. mnon is the unrelated condition (attributable - irritate) and myoy is the morphologically related condition (celebration - celebrate).

MS5 %>%
  group_by(Condition) %>% #group the data by the values of Condition
  summarise(mean = mean(RT)) %>% #calculate the mean RT for each condition group
  arrange(desc(mean)) %>% #show me the result in descending order
  ungroup() #undo the grouping

## # A tibble: 3 × 2
##   Condition  mean
##   <fct>     <dbl>
## 1 mnoy       901.
## 2 mnon       877.
## 3 myoy       851.

MS6 <- MS5 %>%
  filter(Condition != "mnoy") %>% #get only those observations where Condition is not mnoy
  droplevels() #drop empty levels

#checking my work
summary(MS6$Condition)

## mnon myoy 
## 1109 1147

Now we can perform the final clean up of your most important variable: the reaction time RT. Let’s first have a look at how the reaction times are distributed.

#let's have a look at just the numbers with summary()
summary(MS6$RT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   385.0   671.0   781.0   863.6   952.0  5071.0

#and let's also visualize the distribution
ggplot(MS6, aes(x = RT)) +
  geom_density(adjust = 1.5,
               alpha = 0.7,
               fill = "#084081") +
  theme_classic() +
  theme(text = element_text(size = 18)) +
  xlab("RT")

The output of the summary function indicates that we do not have any implausibly short RTs in the data (these would be RTs below 300 ms according to most sources). This means we do not need to do any trimming on the distribution’s righthand edge. There are several different options for trimming reaction times. Since this a comparatively small data frame, I chose a more conservative trimming method that allows me to keep a little more data than some of the other methods using the IQR or SD would have left me with.

#finding the 95th quantile
Q95RT <- quantile(MS6$RT, 0.95)
Q95RT

##    95% 
## 1453.5

MSfinal <- MS6 %>%
  filter(RT < Q95RT) %>% #get only those RTs that are less than the value of the 95th quantile
  droplevels() #drop the empty levels

#checking my work
summary(MSfinal$RT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   385.0   665.0   769.0   810.4   918.5  1453.0

#how much data was lost?
data_before <- nrow(MS6) #save size of MS6 in data before trimming
data_after <- nrow(MSfinal) #save size of MSfinal in data after trimming

#the percentage of data we have left is the quotient of data_after divided by data_before
percentage_left <- (data_after/data_before)

round( #round the result of
  (1 - percentage_left)*100, #subtracting the data left from the whole (1) and multiplying by 100
  0) #round to zero decimal places

## [1] 5

Let’s visualize the distribution of the reaction times now after outlier trimming. For comparison you can also see the distribution before trimming again in the second tab.

4.3.1 After Trimming

ggplot(MSfinal,
       aes(x = RT)) +
  geom_density(adjust = 1.5,
               alpha = 0.7,
               fill = "#084081") +
  theme_classic() +
  theme(text = element_text(size = 18)) +
  scale_x_continuous(limits = c(300,1500)) +
  xlab("RT")

The visualization shows that the data still has a clear positive skew (it is skewed to the right as evidenced by the longer right tail). It is important to keep this in mind when fitting a regression model to the data later. Regression models usually assume the dependent variable (in our case: the reaction time) is normally distributed (i.e. follows a Gaussian distribution). The distribution above is not Gaussian but more consistent with a so-called Gamma distribution. Luckily, we can tell the model that the data has this distribution later. Another option would be to transform the data. However, this could make the reaction time data less intuitive to interpret or shrink/distort the precise individual differences I am interested in. I will not transform the data for this reason.

4.3.2 Before Trimming

ggplot(MS6, aes(x = RT)) +
  geom_density(adjust = 1.5,
               alpha = 0.7,
               fill = "#084081") +
  theme_classic() +
  theme(text = element_text(size = 18)) +
  xlab("RT")

4.4 Creating a New Variable

For some analyses I will need to know if the speaker is a native speaker of English or not. The variable FirstLang holds this information for each speaker.

summary(MS$FirstLang)

##     Afrikaans       Bengali     Bulgarian    Cantonese          Czech 
##            60            60            60            60           120 
##       English English+Other      Estonian         Farsi         Greek 
##          7380           540            60            60            60 
## Hindi_Bengali     Hungarian    Indonesian       Italian        Polish 
##            60            60            60            60           240 
##    Portuguese      Romanian       Russian       Spanish         Tamil 
##            60            60           120            60            60 
##       Turkish 
##            60

nlevels(MS$FirstLang)

## [1] 21

As we can see from the output of applying the summary() and nlevel() functions to the variable, it has 21 levels. This is a level of detail that is not necessary for my analysis. So I will create a new variable that encodes the native language information for each speaker in binary format (Native_Speaker vs. Non-Native_Speaker). I will call the new variable FirstLangBinary.

MSfinal$FirstLangBinary <- case_when( #assign to FirstLangBinary all obs where
  MSfinal$FirstLang == "English" | #FirstLang is English OR
    MSfinal$FirstLang == "English+Other" ~ "Native_Speaker", #English+Other, rename these to "Native_Speaker"
  TRUE ~ "Non-Native_Speaker") #rename all others to "Non-Native_Speaker"

MSfinal$FirstLangBinary <- as.factor(MSfinal$FirstLangBinary) #convert the result to a factor

In the code above I used case_when() to evaluate for each value in FirstLang whether it is “English” or “English+Other”. This information corresponds to people who grew up speaking only English as their native language and bilingual English speakers. For all other values I want the code to assign “Non-Native_Speaker”. Finally, I also want to make sure the output of the case_when() coding is read as a factor in R. Let’s check my work with summary().

summary(MSfinal$FirstLangBinary)

##     Native_Speaker Non-Native_Speaker 
##               1866                277

One final step remains: for easier plotting and comparisons I want to reorder the factor levels of the Condition variable.

#before reordering
levels(MSfinal$Condition)

## [1] "mnon" "myoy"

#reordering the factor levels
MSfinal$Condition <- factor(MSfinal$Condition, levels = c("myoy","mnon"))

#checking my work after reordering
levels(MSfinal$Condition)

## [1] "myoy" "mnon"

Success! Now we can start with the analysis.

5 Analysis

5.1 Reaction Time & Priming Condition

Let’s first have a general look at how speakers reacted to the two priming conditions of interested: morphologically related pairs and unrelated pairs.

MSfinal %>%
  ggplot( aes(x=Condition, y=RT, fill=Condition)) +
  geom_violin() +
  scale_fill_manual(values = TwoColors) +
  theme_classic() +
  theme(
    legend.position="none",
    text = element_text(size = 16)) +
  ggtitle("Reaction Time By Priming Condition (N = 2143)") +
  scale_x_discrete(labels = c("related","unrelated")) +
  xlab("")

This violin plot indicates that on average, across all speakers, reaction times for morphologically related stimuli (think celebration - celebrate) were indeed shorter than reaction times for completely unrelated stimulus pairs (think attributable - irritate). The advantage of using a violin plot over a boxplot here is that the violin plot conveniently gives us the full range of information about the distribution of the reaction times in each category. Boxplots cannot provide this level of detail. That is why I used jitter dots in the first boxplot above to counteract this. But I am not interested in group averages, I am interested in the behavior of individual participants and how they react differently to related and unrelated prime-target pairs.

5.2 Participants’ Individual Priming Effects

Let’s first get the mean reaction for each condition for each participant.

#first for the unrelated condition (mnon)
means_mnon <- MSfinal %>%
  select(Participant, Condition,RT) %>% #extract Participant, Condition, and RT columns
  filter(Condition == "mnon") %>% #get only those obs with Condition = mnon
  group_by(Participant) %>% #group the output by participant
  mutate(Part_MeanRT_mnon = mean(RT)) %>% #create the column Part_MeanRT_mnon, fill with mean of RT
  distinct(Participant, Part_MeanRT_mnon) %>% #get rid of duplicate participant and mean RT rows
  ungroup() #undo the grouping

#checking the result
summary(means_mnon$Part_MeanRT_mnon)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   544.1   746.5   812.8   839.8   921.9  1364.0

#then for the related condition (myoy)
means_myoy <- MSfinal %>%
  select(Participant, Condition,RT) %>% #extract Participant, Condition, and RT columns
  filter(Condition == "myoy") %>% #get only those obs with Condition = myoy
  group_by(Participant) %>% #group the output by participant
  mutate(Part_MeanRT_myoy = mean(RT)) %>% #create the column Part_MeanRT_myoy, fill with mean of RTs
  distinct(Participant, Part_MeanRT_myoy) %>% #get rid of duplicate participant and mean RT rows
  ungroup() #undo the grouping

#checking the result
summary(means_myoy$Part_MeanRT_myoy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   560.1   691.6   789.0   805.4   898.3  1273.0

#now 
MSfinal <- MSfinal %>%
  left_join(means_mnon, MSfinal, by = "Participant") %>%
  left_join(means_myoy, MSfinal, by = "Participant")

Now I can calculate a priming effect for each participant by calculating the difference between the mean reaction time to the unrelated condition and the related condition (MeanRT(mnon) - MeanRT(myoy)).

MSfinal$PrimingEffect <- (MSfinal$Part_MeanRT_mnon - MSfinal$Part_MeanRT_myoy)

#checking my work
summary(MSfinal$PrimingEffect)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -188.972   -9.811   33.200   34.382   89.100  203.100

#there are positive and negative priming effects. I want to encode this property in a new variable

MSfinal <- MSfinal %>%
  mutate(PolarityPriming = case_when( #create variable PolarityPriming that is
    PrimingEffect > 0 ~ "positive", #positive for PrimingEffects greater than zero
    PrimingEffect < 0 ~ "negative") %>% #negative for PrimingEffects less than zero
      factor()) #convert the output to a factor

#checking my work
summary(MSfinal$PolarityPriming)

## negative positive 
##      642     1501

with(MSfinal, tapply(PrimingEffect, PolarityPriming, summary))

## $negative
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -188.97  -79.75  -37.36  -50.73  -18.87   -2.50 
## 
## $positive
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.5694  27.1429  58.2000  70.7851 104.8000 203.1000

We have a lot of range in the priming effects! Negative priming effects mean that a speaker reacted faster to the unrelated condition than they reacted to the related condition. Positive priming effects mean the speaker reacted faster to the morphologically related condition than to the unrelated condition. Let’s visualize the priming effect for each participant so we can really get an idea of the variation.

MSfinal %>%
  distinct(Participant, .keep_all = TRUE) %>% #get only unique participant rows
  sample_n(50, replace = FALSE) %>% #draw a random sample of 50 from these
  ggplot(aes(x = Participant,
             y = PrimingEffect,
             fill = PolarityPriming)) +
  geom_col(position = "identity",
           linewidth = 0.25) +
  scale_fill_manual(values = c("#bf1029", #red for negative values
                               "#3f8f29"), #green for positive calues
                    guide = "none") + #no legend
  theme_classic() +
  theme(axis.text.x = element_blank(), #no text on the ticks for x-axis
        text = element_text(size = 18)) +
  ylab("Priming Effect")

As you can see we have a lot of variation indeed. On the one hand speakers differed in which condition they reacted faster to: red bars represent cases where the unrelated condition was reacted to faster on average by the speaker, green bars represent cases where the speaker on average reacted faster to the morphologically related condition. The magnitude of the differences in reaction times to the two conditions also varies quite a lot across participants as indicated by the different heights of the bars in either direction.

#how many speakers in total reacted faster to the unrelated condition than the related condition (i.e. have a negative priming effect)?

MSfinal %>%
  distinct(Participant, .keep_all = TRUE) %>%
  filter(PolarityPriming == "negative") %>%
  count()

##    n
## 1 38

#38 speakers have a negative priming effect

nlevels(MSfinal$Participant)

## [1] 126

#out of our total of 126 speakers

38/nlevels(MSfinal$Participant)

## [1] 0.3015873

#that is around 30% of our speakers

5.3 Examining Potential Confounding Factors

Just so we can be more sure that it’s really the priming condition that is at work here and no other characteristics of our speakers let’s look at the relationship between priming effect and age as well as priming effect and native speaker status of the participant.

5.3.1 Age

#visualizing the correlation
MSfinal %>%
  distinct(Participant, .keep_all = TRUE) %>%
  ggplot(aes(x = Age,
             y = PrimingEffect)) +
  geom_point() +
  geom_smooth(method=lm , color="#0868AC", fill="gray", se=TRUE) +
  theme_classic() +
  theme(text = element_text(size=18))

## `geom_smooth()` using formula = 'y ~ x'

#testing the correlation

#making helper df to be able to use Base R
MSfinal_unique <- MSfinal %>%
  distinct(Participant, .keep_all = TRUE)

#correlation testing using kendall because we have tied ranks (see plot)
cor.test(MSfinal_unique$Age,
         MSfinal_unique$PrimingEffect, 
         method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  MSfinal_unique$Age and MSfinal_unique$PrimingEffect
## z = -1.041, p-value = 0.2979
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.06418185

The scatter plot shows us that there is only a very weak, negative relationship between age and priming effect. The Kendall’s Tau confirms the magnitude and direction of the relationship. The large p-value indicates that there is no significant correlation between priming effect and speaker age. Next we turn to native speaker status.

5.3.2 Native Speaker Status

#visualizing the relationship
MSfinal %>%
  distinct(Participant, .keep_all = TRUE) %>%
  ggplot(aes(x = FirstLangBinary,
             y = PrimingEffect,
             fill = FirstLangBinary)) +
  geom_boxplot() +
  scale_fill_manual(values=TwoColors)  +
  scale_x_discrete(labels = c("Native Speakers",
                              "Non-Native Speakers")) +
  geom_jitter(color="black", 
              size=0.4, 
              alpha=0.9) +
  stat_summary(fun = mean, 
               geom = "point", 
               shape = 20, 
               size = 5, 
               color="black", 
               fill="black") +
  theme_classic() +
  theme(
    legend.position="none",
    text = element_text(size = 18)) +
  ggtitle("") +
  xlab("")

#testing for difference in means (again using the helper df MSfinal_unique)
wilcox.test(
  MSfinal_unique[MSfinal_unique$FirstLangBinary == "Native_Speaker",]$PrimingEffect,
  MSfinal_unique[MSfinal_unique$FirstLangBinary == "Non-Native_Speaker",]$PrimingEffect, alternative = "greater", conf.int = TRUE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  MSfinal_unique[MSfinal_unique$FirstLangBinary == "Native_Speaker", ]$PrimingEffect and MSfinal_unique[MSfinal_unique$FirstLangBinary == "Non-Native_Speaker", ]$PrimingEffect
## W = 1130, p-value = 0.1361
## alternative hypothesis: true location shift is greater than 0
## 95 percent confidence interval:
##  -11.8095      Inf
## sample estimates:
## difference in location 
##                20.7029

Looking at the boxplot above we can see that there is only a small difference between native and non-native speakers’ mean (large black dots) and median priming effects (horizontal lines in boxes). The Wilcoxon rank sum test I applied returned a non-significant p-value. The small difference we see is not statistically significant. Note, though, that we have very little data in the non-native speaker group compared to the native speaker group.

5.4 More Individuality: Priming & Vocabulary Size

As I explained above, exploring the individual linguistic processing of my speakers was crucial to my dissertation project. Besides finding out how speakers process complex words I also tested the size of their vocabulary using a standardized vocabulary size test (for more information see this link: https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-tests). On this test speakers could achieve a score between zero and 20,000. The score expresses how many word families the test assumes the speakers know. So a score of 20,000 does not mean the speaker knows 20,000 words but that they know at least 20,000 word families so significiantly more than 20,000 individual words.

It is reasonable to assume that speakers should be able to react faster to words that they know. And the greater someone’s vocabulary is, the more words they know and the greater their advantage should be in the priming task that tested their processing of complex words. We will now explore the relationship between morphological processing and vocabulary size in my data.

Let’s start by looking at the relationship between the reaction time in the priming task and the score from the vocabulary size test for each priming condition in the priming task.

MSfinal %>%
  ggplot(aes(x = VSTScore,
             y = RT,
             colour = Condition)) +
  geom_point(size = 1, 
             color = "black") + 
  geom_smooth(method="lm") + 
  labs(x = "VSTScore", 
       y = "RT") + 
  theme_classic() +
  theme(text = element_text(size= 18), 
        legend.position = "bottom",
        legend.text = element_text(size = 16),
        legend.title = element_text(size = 16)) +
  scale_color_discrete(name = "Condition", 
                       labels = c("morphologically related", 
                                  "unrelated"))

We can see that, as evidenced by the negative slopes of both regression lines in the plot, speakers with larger vocabularies generally were able to react faster in the priming task regardless of the priming condition. However, the fact that the red regression line for related prime-target pairs has a steeper negative slope than the blue regression line for the unrelated prime-target pairs also tells us that a larger vocabulary made speakers’ reaction even faster when they were asked to react to a morphologically related prime-target pair. So we see an interaction between vocabulary size and priming condition here.

The overall correlation between vocabulary size and reaction time in the priming experiment is also weak but statistically significant and negative meaning the greater the vocabulary size the lower (i.e. faster) the reaction time.

cor.test(
  MSfinal$VSTScore,
  MSfinal$RT, 
  method = "kendall",
  alternative = "less")

## 
##  Kendall's rank correlation tau
## 
## data:  MSfinal$VSTScore and MSfinal$RT
## z = -8.0912, p-value = 2.955e-16
## alternative hypothesis: true tau is less than 0
## sample estimates:
##       tau 
## -0.118749

I also tested the correlation between the priming effect and vocabulary size and found a weak, positive and statistically significant correlation. This means speakers with greater vocabulary sizes tended to have more positive priming effects (so their reaction to related prime-target pairs was faster than that to unrelated prime-target pairs) and greater positive priming effects. This supports the analysis from above, which also showed that speakers with greater vocabularies react faster to morphologically related prime-target pairs than to unrelated pairs.

#visualizing the relationship
MSfinal_unique %>%
  ggplot(aes(x =  VSTScore,
             y = PrimingEffect)) +
  geom_point() +
  geom_smooth(method=lm , 
              color="#0868AC", 
              fill="gray", 
              se=TRUE) +
  theme_classic() +
  theme(text = element_text(size = 18))

## `geom_smooth()` using formula = 'y ~ x'

#testing the correlation
cor.test(MSfinal_unique$VSTScore,
         MSfinal_unique$PrimingEffect, 
         method = "kendall", 
         alternative = "greater")

## 
##  Kendall's rank correlation tau
## 
## data:  MSfinal_unique$VSTScore and MSfinal_unique$PrimingEffect
## z = 1.6886, p-value = 0.04565
## alternative hypothesis: true tau is greater than 0
## sample estimates:
##       tau 
## 0.1030031

5.5 Cluster Analysis

I wanted to explore my data further using different methods to validate the results from above. Since I am interested in the individual behavior of speakers I do not want an analysis that shows me how the whole group of speakers behaves together. I want an analysis that can identify sub-groups that behave alike within the large group of my speakers. For this reason I decided to do a cluster analysis using hierarchical clustering. We will need functions from the packages stats, cluster, and factoextra for the cluster analysis.

First we need to make a data frame that contains our data that is to be clustered. I want to cluster participants based on their priming effect so I just need a simplified data frame that only contains Participant and PrimingEffect.

#extracting the desired columns as a table
cluster_data <- MSfinal %>%
  select(Participant, PrimingEffect) %>%
  distinct(Participant, .keep_all = TRUE)

#checking the result
head(cluster_data)

##       Participant PrimingEffect
## 1 Participant_001      -6.62500
## 2 Participant_002      15.70000
## 3 Participant_004      51.85000
## 4 Participant_005     -27.48611
## 5 Participant_006      42.77778
## 6 Participant_007     -15.97500

#for next step row names of table need to be levels of Participant
rownames(cluster_data) <- levels(cluster_data$Participant)
head(cluster_data)

##                     Participant PrimingEffect
## Participant_001 Participant_001      -6.62500
## Participant_002 Participant_002      15.70000
## Participant_004 Participant_004      51.85000
## Participant_005 Participant_005     -27.48611
## Participant_006 Participant_006      42.77778
## Participant_007 Participant_007     -15.97500

#and we no longer need the Participant column now so we can set it to NULL
cluster_data$Participant <- NULL

Now we can create a distance matrix to see how different speakers’ priming effect values are from each other. Distance matrices may be calculated in several different ways. I tried Euclidean, Manhattan, Maximum and Canberra and finally settled on Euclidean because it yielded the most interpretable solution.

#creating the Euclidean distance matrix
euclid_dist <- dist(cluster_data, 
                    method = "euclidean")

#checking the result
summary(euclid_dist)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0119  33.7111  72.2571  84.1391 120.7685 392.0722

Great. We can move on to the actual clustering. Just as there are different methods for calculating the distance matrix there are different cluster agglomeration methods as well. I tried ward.D2, complete, single, and average clustering. Ward.D2 and complete clustering both yielded the best results. I will show how I clustered the data using the complete method here.

#clustering
complete_hc <- cluster_data %>%
  scale() %>%                    #scale the data
  dist(method = "euclidean") %>% #compute dissimilarity matrix
  hclust(method = "complete")     #compute hierarchical clustering

#what is the optimal number of clusters?
#we will use silhouette width to find out
cluster_test <- sapply(2:8, #for scenarios 2 through 8 clusters
                       function(x) 
                         summary( #give me a summary
                           silhouette( #of the result of applying silhouette() to
                             cutree(complete_hc, k = x), euclid_dist))$avg.width) 
#the result of applying cutree() to the complete_hc data for x = 2:8, select only the column avg.width from the summary output 

#checking the result
cluster_test

## [1] 0.5009442 0.5659903 0.5549514 0.5573451 0.5458586 0.5562159 0.5388214

#silhouette width ranges from 0 (poor) to 1 (excellent)
#values below 0.2 are already considered too low for effective clustering
#all of our values are well above 0.2 so that is encouraging

#we will find the maximum silhouette width
max(cluster_test)

## [1] 0.5659903

#the maximum silhouette width is the second item in cluster_test
#since cluster_test outputs the result for trying 2 to 8 clusters
#this corresponds to a solution with 3 clusters

In the next step I will visualize the clustering solution.

#we will use the `fviz_dend()` function from `factoextra` to visualize
fviz_dend(complete_hc, k = 3, #make 3 clusters
          cex = 0.5, #set label size
          k_colors = c("#084081","#2B8CBE","#7BCCC4"),
          color_labels_by_k = TRUE, #color labels by groups
          rect = TRUE, #add rectangle around groups
          main = ""
)

The cluster analysis has identified three groups: most participants fall into the largest cluster group in the center. Then there are two more smaller clusters. The darker blue cluster on the very lefthand side is the smallest. The greenish-blue cluster on the very righthand side is the largest.

Let’s connect the clustering information back to the original data frame (MSfinal) so we can figure out exactly how many speakers are in which cluster and what priming effect patterns are predominant in each cluster.

#let's get a list of which participant is in which cluster
cluster_membership_c <- cutree(complete_hc, k = 3)
head(cluster_membership_c)

## Participant_001 Participant_002 Participant_004 Participant_005 Participant_006 
##               1               1               1               2               1 
## Participant_007 
##               2

#let's turn this information into a data frame
clustmember_c.df <- as.data.frame(cluster_membership_c)

#the data frame needs a column with participant names
#currently the participant IDs are only available
#as the row names of the data frame
clustmember_c.df$Participant <- rownames(clustmember_c.df)
head(clustmember_c.df)

##                 cluster_membership_c     Participant
## Participant_001                    1 Participant_001
## Participant_002                    1 Participant_002
## Participant_004                    1 Participant_004
## Participant_005                    2 Participant_005
## Participant_006                    1 Participant_006
## Participant_007                    2 Participant_007

#now we do not need the row names anymore
rownames(clustmember_c.df) <- NULL

#now we can join this information back into MSfinal
MSfinal <- MSfinal %>%
  left_join(clustmember_c.df, MSfinal, by = "Participant")

#checking the result
summary(MSfinal$cluster_membership_c)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.888   3.000   3.000

Now that MSfinal has been enriched with the cluster membership information we can run some more analyses.

#how many speakers are in each cluster group?
MSfinal %>% 
  group_by(cluster_membership_c) %>% 
  summarise(n = length(unique(Participant))) %>%
  ungroup() #undo the grouping

## # A tibble: 3 × 2
##   cluster_membership_c     n
##                  <int> <int>
## 1                    1    54
## 2                    2    31
## 3                    3    41

#from this we can conclude that cluster 1 is the central cluster, cluster 2 is the very lefthand cluster, and cluster 3 is the very righthand cluster in the dendrogram above.

What does the `PrimingEffect look like for each cluster group?

tapply(MSfinal$PrimingEffect, 
       (MSfinal$cluster_membership_c), 
       summary)

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -9.811   8.300  25.250  26.326  42.663  66.022 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -188.97  -89.92  -51.03  -61.21  -32.43  -15.97 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   75.56   89.10  105.08  116.93  129.11  203.10

The summary per cluster group indicates that most speakers in cluster 1 (the largest cluster) have a positive priming effect. So most speakers reacted to the morphologically related condition faster than the morphologically unrelated condition. Those speakers that do have a negative priming effect in this group only have a very small negative priming effect (at a difference of around 9 ms). The second cluster group is soley made up of speakers who very clearly and exclusively show a negative priming effect meaning they usually reacted (much) faster to the unrelated condition than the related condition. So this group of speakers is markedly different from the majority group of speakers we can see in the first cluster. The last group, cluster 3, is also behaves differently from the majority of the other speakers: these are the speakers who always reacted faster to the morphologically related priming condition and usually did so significantly faster than most other speakers in the experiment. We could summarize the behavior of these three groups as follows:

Cluster Group 1 - No Boost/Weak Boost: These speakers get no or only a rather weak reaction time advantage from related prime target pairs.
Cluster Group 2 - No Boost: These speakers get no reaction time boost whatsoever from related prime-target pairs.
Cluster Group 3 - Boost: For these speakers getting a related prime-target pair clearly and substantially boosts their reaction time.

We see this pattern confirmed in the boxplot below. It shows the priming effect value per cluster group. There are absolutely no overlaps between the three groups which underlines the good performance of the clustering above.

MSfinal %>%
  ggplot(aes(x = factor(cluster_membership_c), 
             y = PrimingEffect, 
             fill = factor(cluster_membership_c))) +
  geom_boxplot() +
  scale_fill_manual(values = ThreeColors)  +
  geom_jitter(color="black", size=0.4, alpha=0.9) +
  stat_summary(fun = mean, 
               geom ="point", 
               shape =20, 
               size =5, 
               color ="black", 
               fill ="black") +
  theme_classic() +
  theme(
    legend.position="none",
    text = element_text(size = 20)) +
  scale_x_discrete(labels = c("1" = "No Boost/Weak Boost",
                              "2" = "No Boost",
                              "3" = "Boost")) +
  xlab("")

5.6 Generalized Linear Mixed-Effects (GLMM) Model

Now that we have a good overview of how our most important variables in the data set behave it’s time to put them to the final test all together: a generalized linear mixed-effects model. I will not include the cluster groups or priming effect in the model because the dependent variable in the GLMM model will be the reaction time measured in the priming task. Both the priming effect and the cluster groups were created using the reaction time so including them in the model would not be wise.

The fixed-effect independent variables of the model will be:

Priming Condition (unrelated vs. related)
Vocabulary Size Test Score (scaled and centered around the mean)
First Lang Binary (native vs. non-native)

And the random effects structure will consist of

Random intercepts per Participant Correlated with Random Slopes for the Effect of Priming Condition Per Participant

We do not have a scaled and centered version of the vocabulary size test in the data frame yet so let’s create that real quick.

MSfinal$VSTScore_sc <- scale(MSfinal$VSTScore, center = TRUE, scale = TRUE)

Now we are all set and can specify our model formula. Recall from Section 3.2 that our dependent variable (the reaction time) is not normally distributed by right-skewed. This matches a gamma distribution. If we use the glmer() function we can specify the family argument as “Gamma”, though, so that the model knows not to expect a normal distribution and can handle our data the way it is.

#specifying the model
MS_glmer1 = glmer(RT ~ Condition + VSTScore_sc + FirstLangBinary + (1+ Condition | Participant), data = MSfinal, family=Gamma(link="identity"))

#and inspecting its output
summary(MS_glmer1)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: Gamma  ( identity )
## Formula: RT ~ Condition + VSTScore_sc + FirstLangBinary + (1 + Condition |  
##     Participant)
##    Data: MSfinal
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##   27577.3   27622.6  -13780.6   27561.3      2135 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1855 -0.6817 -0.2087  0.5003  4.2340 
## 
## Random effects:
##  Groups      Name          Variance  Std.Dev. Corr 
##  Participant (Intercept)   6.288e+03 79.2970       
##              Conditionmnon 2.689e+03 51.8552  -0.27
##  Residual                  3.777e-02  0.1943       
## Number of obs: 2143, groups:  Participant, 126
## 
## Fixed effects:
##                                   Estimate Std. Error t value Pr(>|z|)    
## (Intercept)                        806.384      8.631  93.425  < 2e-16 ***
## Conditionmnon                       35.567      7.251   4.905 9.33e-07 ***
## VSTScore_sc                        -36.802      8.395  -4.384 1.17e-05 ***
## FirstLangBinaryNon-Native_Speaker   70.198     10.822   6.487 8.78e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Cndtnm VSTSc_
## Conditinmnn -0.056              
## VSTScore_sc  0.057  0.089       
## FrstLBN-N_S -0.133 -0.074 -0.073

All of our dependent variables emerge as significant. Let’s test this more rigorously using the drop1() function that compares all possible models the you could build by dropping one of the terms currently included in the model.

#any unnecessary predictors in MS_glmer1?
drop1_MS_glmer1 <- drop1(MS_glmer1, test="Chisq")

#present the result sorted by p-vale
drop1_MS_glmer1[order(drop1_MS_glmer1[,"Pr(Chi)"]),]

## Single term deletions
## 
## Model:
## RT ~ Condition + VSTScore_sc + FirstLangBinary + (1 + Condition | 
##     Participant)
##                 npar   AIC     LRT   Pr(Chi)    
## Condition          1 27588 12.2105 0.0004752 ***
## VSTScore_sc        1 27581  5.6942 0.0170215 *  
## FirstLangBinary    1 27578  2.6850 0.1013002    
## <none>               27577                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#FirstLangBinary can go

The result of applying drop1() shows us that FirstLangBinary is an unnecessary predictor because the high p-value associated with it tells us that a model without FirstLangBinary is very likely just as good as a the full model meaning it might as well not be in the model at all. So let’s update our original model accordingly and get the summary of the new model right away.

#let's specify an updated version of the model
MS_glmer1a = glmer(RT ~ Condition + VSTScore_sc + (1+ Condition | Participant), data = MSfinal, family=Gamma(link="identity"))

#testing the original model against the new model using ANOVA.
anova(MS_glmer1, MS_glmer1a, test = "Chisq")

## Data: MSfinal
## Models:
## MS_glmer1a: RT ~ Condition + VSTScore_sc + (1 + Condition | Participant)
## MS_glmer1: RT ~ Condition + VSTScore_sc + FirstLangBinary + (1 + Condition | Participant)
##            npar   AIC   BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
## MS_glmer1a    7 27578 27618 -13782     27564                    
## MS_glmer1     8 27577 27623 -13781     27561 2.685  1     0.1013

#again, model with FirstLangBinary not significantly better than model without it

#any unnecessary predictors in MS_glmer1a?
drop1_MS_glmer1a <- drop1(MS_glmer1a, test="Chisq")

#present the result sorted by p-vale
drop1_MS_glmer1a[order(drop1_MS_glmer1a[,"Pr(Chi)"]),]

## Single term deletions
## 
## Model:
## RT ~ Condition + VSTScore_sc + (1 + Condition | Participant)
##             npar   AIC     LRT  Pr(Chi)    
## Condition      1 27588 12.0969 0.000505 ***
## VSTScore_sc    1 27585  9.0342 0.002650 ** 
## <none>           27578                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#no, all remaining predictors significant

Now that we know that the updated model is better than the original model and all remaining predictors are necessary for a good model fit we will take ta closer look at the model’s output.

summary(MS_glmer1a)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: Gamma  ( identity )
## Formula: RT ~ Condition + VSTScore_sc + (1 + Condition | Participant)
##    Data: MSfinal
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##   27578.0   27617.6  -13782.0   27564.0      2136 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1936 -0.6762 -0.2095  0.4937  4.2414 
## 
## Random effects:
##  Groups      Name          Variance  Std.Dev. Corr 
##  Participant (Intercept)   6.354e+03 79.7115       
##              Conditionmnon 2.685e+03 51.8203  -0.27
##  Residual                  3.777e-02  0.1944       
## Number of obs: 2143, groups:  Participant, 126
## 
## Fixed effects:
##               Estimate Std. Error t value Pr(>|z|)    
## (Intercept)    816.093      8.340  97.849  < 2e-16 ***
## Conditionmnon   35.583      7.041   5.054 4.33e-07 ***
## VSTScore_sc    -44.659      7.932  -5.630 1.80e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Cndtnm
## Conditinmnn -0.122       
## VSTScore_sc  0.026 -0.067

Let’s first address the fixed effects. The intercept is 816 meaning that all other predictors being at their reference level

Condition = myoy so related prime-target pairs
Vocabulary Size Test Score = zero, which in this case means “average” because, remember, this variable has been centered

the reaction time is at 816 ms.

The coefficient for the value mnon (so: unrelated prime-target pairs) of Condition is positive. This means that compared to the reference level (related prime-target pairs (myoy)) the unrelated priming condition increases the baseline reaction time indicated by the intercept (i.e. makes speakers react more slowly) by about 35 ms on average. Vocabulary size (VSTScore_sc), on the other hand, has a negative coefficient. For every 1 unit increase in vocabulary size (meaning as vocabulary size becomes greater than average) reaction time is decreased by around 44 ms.

You can see the two fixed effects plotted below. In the first plot we can see that reaction time increased for unrelated prime-target pairs (mnon) compared to related prime-target pairs (myoy).

plot(Effect("Condition", MS_glmer1a), main = "")

This second plot visualizes the effect of the centered vocabulary size on reaction time and shows that the further above average (that is above zero) a speaker’s vocabulary size is, the faster their reaction time becomes. Conversely, speakers with below average vocabulary sizes react more slowly the further their vocabulary is below average.

plot(Effect("VSTScore_sc", MS_glmer1a), main = "", xlab = "Vocabulary Size")

Let’s also have a look at the model’s random effects.

plot_model(MS_glmer1a, type = "re")

We find quite a bit of structure here. In the right panel we can see the individual intercept adjustments for all participants. Clearly, roughly half of the participants were generally slower than average (blue lines and dots, positive intercept adjustment), while the other half was faster than average as evidenced by the negative intercept adjustment (red lines and dots) they required.

The model also gives us individual slope adjustments for the effect of the unrelated mnon priming condition (left panel). The polarity (positive vs. negative) of the slope adjustments varies across participants as well as the magnitude of the adjustments. To take a closer look at this I will calculate the finally adjusted slopes for each participant (so baseline model slope of 35.567 from the fixed effects plus the value for the individual slope adjustment for that speaker) and then divide the results into groups based on they compare to the overall slope for the unrelated condition from the model.

Let’s start by putting the random effects in a data frame and transforming it.

#extracting the random effects to an object
rancoefs_MS1a <- ranef(MS_glmer1a)

#turning the object into a data frame
rancoefs_MS1a_df <- as.data.frame(rancoefs_MS1a)
head(rancoefs_MS1a_df)

##        grpvar        term             grp    condval   condsd
## 1 Participant (Intercept) Participant_001  -77.71567 39.68280
## 2 Participant (Intercept) Participant_002  -60.16436 34.26855
## 3 Participant (Intercept) Participant_004 -142.71119 30.49334
## 4 Participant (Intercept) Participant_005  106.68063 44.86770
## 5 Participant (Intercept) Participant_006  -35.39117 38.08683
## 6 Participant (Intercept) Participant_007 -238.19989 29.98914

#we only need term, grp, and condval for our calculations
rancoefs_MS1a_df <- rancoefs_MS1a_df %>%
  select(term, grp, condval)

#checking my work
head(rancoefs_MS1a_df)

##          term             grp    condval
## 1 (Intercept) Participant_001  -77.71567
## 2 (Intercept) Participant_002  -60.16436
## 3 (Intercept) Participant_004 -142.71119
## 4 (Intercept) Participant_005  106.68063
## 5 (Intercept) Participant_006  -35.39117
## 6 (Intercept) Participant_007 -238.19989

#pivoting the df to wide format instead of long
rancoefs_MS1a_df <- rancoefs_MS1a_df %>% 
  pivot_wider(names_from = term, values_from = condval)

#checking my work
head(rancoefs_MS1a_df)

## # A tibble: 6 × 3
##   grp             `(Intercept)` Conditionmnon
##   <fct>                   <dbl>         <dbl>
## 1 Participant_001         -77.7        -15.3 
## 2 Participant_002         -60.2         -9.40
## 3 Participant_004        -143.           7.46
## 4 Participant_005         107.         -11.8 
## 5 Participant_006         -35.4          1.76
## 6 Participant_007        -238.         -26.3

Now we can create a new column that will record the result of adding the individual Conditionmnon adjustment to the baseline slope of 35.567.

#creating new column with individually adjusted slope
rancoefs_MS1a_df <- rancoefs_MS1a_df %>%
  mutate(indiv_adjusted_slope_mnon = Conditionmnon + 35.567)

And let’s visualize the results.

rancoefs_MS1a_df %>%
ggplot(aes(
  x = fct_rev(fct_reorder(grp, indiv_adjusted_slope_mnon)), 
  y = indiv_adjusted_slope_mnon)) +
  geom_point(color = "#0868AC", size = 2) + 
  geom_segment(aes(
    x=fct_rev(fct_reorder(grp, indiv_adjusted_slope_mnon)),
    xend=fct_rev(fct_reorder(grp, indiv_adjusted_slope_mnon)), 
    y = 0, 
    yend = indiv_adjusted_slope_mnon),
                   color = "#0868AC",
                   linewidth = 1) +
  labs(x = "", y = "") +
  geom_hline(yintercept = 35.567,
             col = "red",
             linetype = "dashed",
             linewidth = 0.8) +
  annotate("text", 
           x = 110, 
           y = 45, 
           label = "35.567 Average Slope \nfrom Model") +
  ggtitle("Individually Adjusted Slopes \nFor Unrelated Condition Across All Speakers (N = 126)") +
  theme_classic() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.x = element_blank(),
        text = element_text(size = 12))

The random effects align with the previous analyses and the results of the cluster analysis. We can see in the graph above that for roughly half of the speakers their finally and indiviudally adjusted slope is not greater than the average slope for the effect of the unrealted priming condition mnon in the model. I interpret this to mean that these speakers were slowed down by the unrelated condition but not much more than is usual in the model. It is striking that we also have six speakers on the very righthand side of the graph whose adjusted slopes are negative. These are the speakers whose reaction time was actually decreased by the unrelated priming condition. The second major group in the graph are those speakers whose adjusted slope is well above the average slope in the model. These are the speakers that were significantly slowed down by the unrelated prime-target pairs.

6 Summary & Conculsion

The results of this experiment show that individual speakers process morphologically complex words such as identifiable, celebration, or discriminatory differently. At a group average level it is true that morphologically related prime-target pairs are reacted to faster, which indicates that they are processed faster. However, this is not true for every speaker. Some speakers do not show a faster reaction time when they are primed with a morphologically complex word (celebration) that already contains the target word (celebrate). From this I conclude that these speakers also do not tend to split morphologically complex words up into their component parts when they encounter them. Other speakers do show a faster reaction time related prime-target pairs, though. These speakers likely do split up the complex words they encounter into their individual components.

These two different ways of processing morphologically complex words could potentially have consequences for the way speakers stress these words when pronouncing them. We could hypothesize that those speakers who show reaction time facilitation with related prime-target pairs in this experiment, process morphologically complex words in parts. When pronouncing such a word, these speakers would be aware of the base word hidden in the complex word and also take its stress into account. In other words: these speakers see the idéntify in identifiable and therefore say idéntifiable. Conversely, those speakers who do not react faster to the morphologically related prime-target pairs in this task, likely do not process complex words in parts. So they do not see the identify in identifiable right away and therefore likely say identifíable. I tested these hypotheses in my dissertation.

Do you see identifiable as identify + -able
or just identifiable?
-
Individual Differences
in the Processing Of Complex Words

Tammy Ganster

2025-03-25

1 Preamble

2 Introduction

3 Method

4 Preparing the Analysis

4.1 Setting Up and Loading the Data

4.2 Inspecting the Data Frames

4.3 Cleaning the Data

4.3.1 After Trimming

4.3.2 Before Trimming

4.4 Creating a New Variable

5 Analysis

5.1 Reaction Time & Priming Condition

5.2 Participants’ Individual Priming Effects

5.3 Examining Potential Confounding Factors

5.3.1 Age

5.3.2 Native Speaker Status

5.4 More Individuality: Priming & Vocabulary Size

5.5 Cluster Analysis

5.6 Generalized Linear Mixed-Effects (GLMM) Model

6 Summary & Conculsion

Do you see identifiable as identify + -able or just identifiable?- Individual Differences in the Processing Of Complex Words

Tammy Ganster

2025-03-25

1 Preamble

2 Introduction

3 Method

4 Preparing the Analysis

4.1 Setting Up and Loading the Data

4.2 Inspecting the Data Frames

4.3 Cleaning the Data

4.3.1 After Trimming

4.3.2 Before Trimming

4.4 Creating a New Variable

5 Analysis

5.1 Reaction Time & Priming Condition

5.2 Participants’ Individual Priming Effects

5.3 Examining Potential Confounding Factors

5.3.1 Age

5.3.2 Native Speaker Status

5.4 More Individuality: Priming & Vocabulary Size

5.5 Cluster Analysis

5.6 Generalized Linear Mixed-Effects (GLMM) Model

6 Summary & Conculsion

Do you see identifiable as identify + -able
or just identifiable?
-
Individual Differences
in the Processing Of Complex Words