This tutorial provides an overview of different plotting techniques and statistical methods that can be used to perform a productivity analysis of linguistic constructions. First, it is shown how a dataset comprised of individual attestations (= rows) and variables (= columns) can give rise to different types of plots and analyses. In part 2, a dataset containing productivity measurements (= columns) per construction (= rows) is used as a starting point for further statistical analysis.
The first step is to import a csv file with the toy dataset. The data that we are going to use here, contain four variables:
data<-read.table(file="toy_dataset.csv", header=TRUE, sep=";", comment.char="", fill=TRUE, quote="", row.names=NULL, stringsAsFactors=FALSE, strip.white=TRUE, encoding = "utf-8", blank.lines.skip = TRUE)
head(data)
Note that the first row of the csv file is read as a header and that the strings (e.g. type_1, type_2 etc.) are read as character strings, and not as factors.
In this first step, we will use three packages:
.libPaths("C:/R/library") # this line of code tells R where packages are saved on your pc: you might have to adjust this
library('dplyr') ; library("ggplot2") ; library('zipfR')
We transform the data such that we obtain the frequency of each type per construction, as exemplified by the dataframe below.
tokens<-select(data, TYPE, CONSTRUCTION) # the relevant variables are selected
type_dataframe<-as.data.frame(table(tokens))
type_dataframe<-filter(type_dataframe, Freq!=0) # zero frequencies are filtered out of the dataframe
cx<-as.character(unique(type_dataframe[,"CONSTRUCTION"]))
(head(type_dataframe))
In the code below, a tfl object is created for each construction, which can be used to visualize a "Type-Frequency List" with the zipfR package.
list_dataframes<-list() # first, we have to set up an empty list
for (i in 1:length(cx)) { # loop
a<-filter(type_dataframe, CONSTRUCTION == cx[i]) # per construction
f<-a[,"Freq"]
type_freq<-tfl(f, k=1:length(f)) # f = vector with frequency of each type & k = integer vector of type IDs
list_dataframes[[i]]<-type_freq} # each construction and its frequency distribution is a different component of the list
head(list_dataframes[[3]])
k f
3 3 105
4 4 45
5 5 7
1 1 1
2 2 1
6 6 1
The Type-Frequenc List ranks the different types according to decreasing frequency. Note that the symbol "N" stands for the number of tokens in the sample of the construction and "V" indicates the number of different types in the sample of that particular construction. Now that we have all the information, we can plot the Type-Frequency List.
type_freq<-plot(list_dataframes[[1]], list_dataframes[[2]], list_dataframes[[3]], list_dataframes[[4]], bw=TRUE, col=c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3") , legend=cx, xlim=c(0,50), ylim=c(0,120), type="b")
Note that you can also use the information in the tfl object to create your own custom-made plot. In the plot below, types of similar frequency are clustered together with the same colour. The horizontal dashed line line corresponds to the mean frequency of the 10 most frequent types.
frequency<-sort(list_dataframes[[3]]$f, decreasing = TRUE) # sort frequencies of the types
dist_freq<-dist(frequency, method = "euclidean") # calculate Euclidean distance
clust_dist_freq <- hclust(dist_freq, method = "ward.D2") # hierarchical clustering of the types based on the distances between the frequencies
clusters<-cutree(clust_dist_freq, h=5) # cut the tree at a certain height h to determine a certain number of clusters
type_freq_2<-data.frame(seq(1,103), as.integer(frequency), as.factor(clusters)) # dataframe with information to be plotted
colnames(type_freq_2)<-c("rank", "frequency", "cluster") # change column names of the dataframe
mean_top_10 <- mean(frequency[1:10]) # calculate the mean of the top 10 frequencies
ggplot(data=type_freq_2, aes(x=rank, y=frequency, group=cluster)) + geom_point(aes(color = cluster), size = 3) + scale_color_manual(values=c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00", "#add8e6", "#a65628", "#000000")) + theme_classic() + guides(color = FALSE) + ggtitle("Type-Frequency List (Zipf ranking)") + theme(plot.title = element_text(hjust = 0.5)) + geom_hline(yintercept=mean_top_10, linetype = "dashed") + annotate(geom = "text", 90,mean_top_10+3,label = "mean of freq top 10 types") # points are colored according to cluster
Next, we create a Frequency spectrum plot based on the information of the tfl object.
list_dataframes_spec<-list()
for (i in 1:length(cx)) {
freq_spec<-tfl2spc(list_dataframes[[i]]) # the function tfl2spc from the ZipfR package
list_dataframes_spec[[i]]<-freq_spec}
The Frequency spectrum plot below shows the frequency for V1 (hapax legomena), V2 (dis legomena) etc., per construction.
freq_spec<-plot(list_dataframes_spec[[1]], list_dataframes_spec[[2]], list_dataframes_spec[[3]], list_dataframes_spec[[4]], bw=TRUE, barcol=c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3"), legend=cx, m.max=5)
In this section, we construct an empirical vocabulary growth curve. The first chunk establishes the type frequency for each count of tokens within the window of the sample size.
list_empirical_vgc<-list()
c<-vector()
for (i in 1:length(cx)) {
a<-filter(tokens, CONSTRUCTION == cx[i]) # per construction
for (j in 1:length(a[,1])) { # loop within a loop
b<-a$TYPE[1:j]
c<-append(c, length(unique(b))) # the number of types after j tokens (max of j = sample size for that construction) is determined
}
d<-c(1:length(a[,1]))
e<-vgc(N=d, V = c) # vocabulary growth curve function from the ZipfR package
#N = integer vector of sample sizes N for which vocabulary growth data is available
#V = vector of corresponding type frequencies
list_empirical_vgc[[i]]<-e
c<-vector()
}
The next code chunk does the same for the hapax frequency.
list_empirical_vgc_V1<-list()
z<-vector()
for (i in 1:length(cx)) {
a<-filter(tokens, CONSTRUCTION == cx[i]) # per construction
for (j in 1:length(a[,1])) {
b1<-a$TYPE[1:j]
b2<-as.data.frame(table(b1))
b3<-filter(b2, Freq==1) # only types that occur once are taken into account
b4<-nrow(b3) # the number of hapaxes is counted
z<-append(z, b4)
}
d<-c(1:length(a[,1]))
e<-vgc(N=d, V = list_empirical_vgc[[i]]$V, Vm=data.frame(z))
# the information about the type frequencies (V) is extracted from "list_empirical_vgc" established in the chunk above
# Vm adds the hapax frequencies
list_empirical_vgc_V1[[i]]<-e
z<-vector()
}
Finally, we draw the empirical vocabulary growth curve for cx_1 and cx_4. The thinner lines correspond to the evolution of the hapax frequency.
empirical_vgc_with_V1<-plot(list_empirical_vgc_V1[[1]], list_empirical_vgc_V1[[4]], col=c("#e41a1c", "#377eb8"), legend=cx[c(1,4)], xlim=c(0,400), main = "Empirical Vocabulary Growth V1", add.m=1) # "add.m=1" adds the hapax frequencies
The hapax type ratio is a different productivity measure, in addition to type frequency and hapax frequency. Similar to how we established the evolution of the type frequency and the hapax frequency throughout the sample, we can now do the same thing for the hapax type ratio, using the ggplot2 package.
hapax_type_ratio_evolution <- list_empirical_vgc_V1[[2]]$V1 / list_empirical_vgc_V1[[2]]$V
hapax_token_evolution <- list_empirical_vgc_V1[[2]]$V1 / list_empirical_vgc_V1[[2]]$N
type_token_evolution <- list_empirical_vgc_V1[[2]]$V / list_empirical_vgc_V1[[2]]$N
N<-rep(c(1:400), 3) # create a dataframe with three ratio measures
ratio_measures<-rep(c("hapax_type_ratio", "hapax_token_ratio", "type_token_ratio"), each= 400)
ratio_overview<- c(hapax_type_ratio_evolution, hapax_token_evolution, type_token_evolution)
ratio_data_frame<-data.frame(N, ratio_measures, ratio_overview)
head(ratio_data_frame)
ggplot(data = ratio_data_frame, aes(x=N, y= ratio_overview)) + geom_point(shape=1, aes(colour=ratio_measures)) + theme_minimal() + xlab("sample size") + ylab("ratio values") + scale_color_manual(name="Ratios", labels=c("hapax token ratio","hapax type ratio","type token ratio"), values=c("#e41a1c", "#377eb8", "#4daf4a"))
The plot below displays the same information, but adds a smoother (cf. method = "loess"). The dashed horizontal line indicates the minimum value of the HTR.
ggplot(data = ratio_data_frame, aes(x=N, y= ratio_overview)) + geom_smooth(size = 1.5, method = "loess", se = FALSE, aes(colour=ratio_measures)) + theme_minimal() + xlab("sample size") + ylab("ratio values") + scale_color_manual(name="Ratios", labels=c("hapax token ratio","hapax type ratio","type token ratio"), values=c("#e41a1c", "#377eb8", "#4daf4a")) + geom_hline(yintercept=min(hapax_type_ratio_evolution), linetype = "dashed")
This section demonstrates how to generate a dataframe with different productivity measures per construction.
summary_dataframe<-data.frame()
for (i in 1:length(cx)) {
summary_dataframe[i,1]<-cx[i]
summary_dataframe[i,2]<-N(list_dataframes[[i]]) # tokens
summary_dataframe[i,3]<-V(list_dataframes[[i]]) # types
summary_dataframe[i,4]<-Vm(list_dataframes[[i]],1) # hapax
summary_dataframe[i,5]<-Vm(list_dataframes[[i]],2) # dis legomena
summary_dataframe[i,6]<-summary_dataframe[i,3] / summary_dataframe[i,2] # type_token_ratio
summary_dataframe[i,7]<-summary_dataframe[i,4] / summary_dataframe[i,2] # hapax_token_ratio
summary_dataframe[i,8]<-summary_dataframe[i,4] / summary_dataframe[i,3] # hapax_type_ratio
summary_dataframe[i,9]<-summary_dataframe[i,5] / summary_dataframe[i,3] # dis_type_ratio
summary_dataframe[i,10]<-attr(list_dataframes[[i]], "f.max") # token_freq of type with highest token_freq
}
colnames(summary_dataframe)<-c("CONSTRUCTION", "token_freq_sample", "type_freq", "hapax_freq", "dis_legomena_freq", "type_token_ratio", "hapax_token_ratio", "hapax_type_ratio", "dis_type_ratio", "highest_token_freq_within_types")
summary_dataframe
It can be interesting to evaluate productivity measurements per category (morpho-syntactic, semantic etc.). First, we calculate this in the code chunk below.
# token freq per CAT
first_selection<-select(data, CONSTRUCTION, CAT)
dataframe_per_cat<-as.data.frame(table(first_selection))
dataframe_per_cat$CX_CAT <- paste(dataframe_per_cat$CONSTRUCTION, dataframe_per_cat$CAT, sep = "+") # add variable CONSTRUCTION_CAT
dataframe_per_cat <-dataframe_per_cat[,c(-1:-2)]
dataframe_per_cat<-filter(dataframe_per_cat, Freq!=0)
colnames(dataframe_per_cat)<-c("token_freq","CONSTRUCTION_CAT")
# type freq per CAT
second_selection<-select(data, CONSTRUCTION, CAT, TYPE)
dataframe_per_cat_2<-as.data.frame(table(second_selection))
dataframe_per_cat_2$CX_CAT <- paste(dataframe_per_cat_2$CONSTRUCTION, dataframe_per_cat_2$CAT, sep = "+") # add variable CONSTRUCTION_CAT
dataframe_per_cat_2 <-dataframe_per_cat_2[,c(-1:-2)]
dataframe_per_cat_2<-filter(dataframe_per_cat_2, Freq!=0)
type_count<-as.data.frame(table(dataframe_per_cat_2$CX_CAT))
type_count$Var1<-as.character(type_count$Var1)
colnames(type_count)<-c("CONSTRUCTION_CAT", "type_freq")
# hapax freq per CAT
dataframe_per_cat_3<-filter(dataframe_per_cat_2, Freq==1)
hapax_count<-as.data.frame(table(dataframe_per_cat_3$CX_CAT))
colnames(hapax_count)<-c("CONSTRUCTION_CAT", "hapax_freq")
# add a zero hapax count for those combinations CONSTRUCTION-CAT that have no hapaxes, only types with freq > 1
if(length(hapax_count$hapax_freq) != length(type_count$type_freq)) {
missing_values<-setdiff(type_count$CONSTRUCTION_CAT, hapax_count$CONSTRUCTION_CAT)
extra <-data.frame(missing_values, rep(0,length(missing_values)))
colnames(extra)<-c("CONSTRUCTION_CAT","hapax_freq")
hapax_count <- rbind(hapax_count, extra)}
# merge different dataframes
overview_data<-merge(dataframe_per_cat, type_count, by ="CONSTRUCTION_CAT", all=TRUE)
overview_data<-merge(overview_data, hapax_count, by="CONSTRUCTION_CAT", all=TRUE)
overview_data$CONSTRUCTION<-unlist(lapply(strsplit(overview_data$CONSTRUCTION_CAT, "+", fixed=T), function (x) {x[1]})) # add again the variable CONSTRUCTION on the basis of CONSTRUCTION_CAT
head(overview_data)
This information can now be visualized in the graph below. The x-axis corresponds to the within-category type frequency, whereas the y-axis corresponds to the within-category hapax frequency. The size of the dots represent the within-category token frequency.
ggplot(data = overview_data, aes(x=type_freq, y= hapax_freq, group = CONSTRUCTION)) + labs(x = "type frequency", y = "hapax frequency", title = "Productivity measures within categories", size = "token frequency") + geom_point(aes(size = token_freq, color = CONSTRUCTION), position = "jitter", alpha = 1/2) + theme_minimal() + scale_color_manual(name="construction", labels=c("cx 1","cx 2","cx 3", "cx 4"), values=c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3")) # for geom_point, both "size" and "color" are within the scope of "aes" (aesthetic mappings)
Finally, we can construct a LNRE model to make predictions beyond the attested sample size.
lnre_1<-lnre("fzm", list_dataframes_spec[[3]], exact=TRUE) # fzm = finite Zipf-Mandelbrot
summary(lnre_1)
finite Zipf-Mandelbrot LNRE model.
Parameters:
Shape: alpha = 0.7302146
Lower cutoff: A = 1.388456e-19
Upper cutoff: B = 0.819483
[ Normalization: C = 0.284674 ]
Population size: S = 2.295597e+13
Sampling method: Poisson, with exact calculations.
Parameters estimated from sample of size N = 400:
Goodness-of-fit (multivariate chi-squared test):
vgc_lnre <- lnre.vgc(lnre_1, (1:1000), variances = TRUE) # lnre.vgc computes expected vocabulary growth curves according to a LNRE model
fzm.spc <- lnre.spc(lnre_1, N(lnre_1)) # lnre.spc computes the expected frequency spectrum of a LNRE model at specified sample size N
The aspects coloured in red in the two plots below correspond to what the model predicts.
plot(list_dataframes_spec[[3]], fzm.spc, legend=c("observed", "fZM"), main = "Frequency Spectrum")
plot(list_empirical_vgc[[3]], vgc_lnre, N0=N(lnre_1), legend=c("observed", "fZM"), xlim= c(0,1000), main = "Vocabulary Growth")
As illustrated by the plots below, the predicted vocabulary growth becomes slightly different if we use only the first 300 tokens of the original sample consisting of 400 tokens to base the predictions on.
data_2<-select(data, TYPE, CONSTRUCTION)
reduced<-filter(data_2, CONSTRUCTION == "cx_3")
reduced<-reduced[1:300,]
reduced_2<-as.data.frame(table(reduced))
reduced_2<-filter(reduced_2, Freq!=0)
reduced_3<-reduced_2[,"Freq"]
reduced_4<-tfl(reduced_3, k=1:length(reduced_3))
reduced_5<-tfl2spc(reduced_4)
c<-vector()
for (j in 1:length(reduced[,1])) {
b<-reduced$TYPE[1:j]
c<-append(c, length(unique(b)))}
d<-c(1:length(reduced[,1]))
reduced_6<-vgc(N=d, V = c)
lnre_2<-lnre("fzm", reduced_5, exact=TRUE)
vgc_lnre_2 <- lnre.vgc(lnre_2, (1:1000), variances = TRUE)
plot(vgc_lnre, vgc_lnre_2, legend=c("fZM_400", "fZM_300"), xlim= c(0,1000), main = "Vocabulary Growth")
For this reason, we have to be careful about prediction accuracy far beyond the attested sample size.
This second part aims to analyze a somewhat larger set of constructions, accompanied by a series of productivity measurements.
(data_bis<-read.table(file="toy_dataset_productivity_measures.csv", header=TRUE, sep=";", comment.char="", fill=TRUE, na.strings = "character", quote="", row.names=NULL, stringsAsFactors=FALSE, strip.white=TRUE, encoding = "utf-8", blank.lines.skip = TRUE))
In addition to more traditional productivity measures, a series of measurements that aim to describe the summit of the frequency distribution are also added. For example:
We load some R packages:
.libPaths("C:/R/library")
library('FactoMineR', quietly = TRUE) ; library('factoextra', quietly = TRUE) ; library("corrplot", quietly = TRUE) ; library("PerformanceAnalytics", quietly = TRUE) ; library("MASS", quietly = TRUE) ; library("effects", quietly = TRUE) ; library("vcd", quietly = TRUE) ; library("car", quietly = TRUE) ; library("performance", quietly = TRUE) ; library("DHARMa", quietly = TRUE) ; library("partykit", quietly = TRUE) ; library("rpart", quietly = TRUE) ; library("rpart.plot", quietly = TRUE)
Before we tackle the PCA, let's take a look at the correlations between the productivity measurements:
correlations<-round(cor(data_bis[,-1:-3]),2)
corrplot(correlations, type="upper", order="hclust", tl.col="black", tl.srt=45)
chart.Correlation(data_bis[,-1:-3], histogram=F, pch=19)
Principal Component Analysis (henceforth PCA) is a dimension reduction technique that can be applied by means of the R package FactoMineR. Note that, by default, the variables are standardized (cf. scale.unit = TRUE), prior to the transformation of the variables into principal components.
data_prod<-data_bis
row.names(data_prod)<-data_prod[,1]
data_prod<-data_prod[,c(-1:-3)]
PCA_prod<-PCA(data_prod, graph = FALSE, scale.unit = TRUE)
summary(PCA_prod)
Call:
PCA(X = data_prod, scale.unit = TRUE, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
Variance 6.099 3.002 0.558 0.287 0.027 0.016 0.008 0.003 0.000 0.000
% of var. 60.987 30.015 5.577 2.875 0.274 0.156 0.082 0.034 0.000 0.000
Cumulative % of var. 60.987 91.002 96.579 99.454 99.727 99.883 99.965 100.000 100.000 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
cx_1 | 1.672 | -1.319 2.194 0.622 | -0.195 0.097 0.014 | 0.884 10.787 0.280 |
cx_2 | 2.645 | -2.518 7.997 0.907 | -0.147 0.056 0.003 | 0.494 3.362 0.035 |
cx_3 | 2.991 | -2.377 7.126 0.631 | 1.408 5.082 0.222 | -1.082 16.157 0.131 |
cx_4 | 3.192 | -2.447 7.555 0.588 | 1.512 5.856 0.224 | 0.693 6.630 0.047 |
cx_5 | 3.378 | 2.525 8.041 0.559 | -2.137 11.699 0.400 | -0.590 4.803 0.031 |
cx_6 | 3.055 | 1.800 4.087 0.347 | -2.435 15.195 0.635 | -0.045 0.027 0.000 |
cx_7 | 3.057 | -0.528 0.351 0.030 | -2.666 18.220 0.761 | 1.352 25.203 0.196 |
cx_8 | 2.326 | -2.153 5.844 0.857 | 0.178 0.082 0.006 | -0.694 6.643 0.089 |
cx_9 | 4.712 | 4.181 22.050 0.787 | 2.114 11.451 0.201 | 0.178 0.436 0.001 |
cx_10 | 4.231 | 4.126 21.475 0.951 | -0.328 0.276 0.006 | -0.583 4.690 0.019 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
type_freq | -0.927 14.083 0.859 | 0.371 4.587 0.138 | -0.017 0.051 0.000 |
hapax_freq | -0.879 12.661 0.772 | 0.466 7.228 0.217 | -0.077 1.072 0.006 |
dis_legomena_freq | -0.916 13.768 0.840 | -0.070 0.165 0.005 | 0.342 21.012 0.117 |
hapax_type_ratio | -0.657 7.077 0.432 | 0.717 17.150 0.515 | -0.010 0.017 0.000 |
dis_type_ratio | -0.023 0.009 0.001 | -0.865 24.953 0.749 | 0.487 42.440 0.237 |
token_freq_max | 0.935 14.326 0.874 | 0.330 3.625 0.109 | 0.056 0.572 0.003 |
avg_summit_10 | 0.989 16.036 0.978 | -0.103 0.354 0.011 | -0.067 0.809 0.005 |
diff_max_avg_10 | 0.868 12.342 0.753 | 0.461 7.067 0.212 | 0.096 1.646 0.009 |
diff_r1_r2 | 0.762 9.531 0.581 | 0.590 11.602 0.348 | 0.165 4.881 0.027 |
diff_r1_r2_norm | 0.101 0.167 0.010 | 0.836 23.269 0.698 | 0.392 27.500 0.153 |
As illustrated by the following scree plot, the first two dimensions capture most of the original variance in the data.
screeplot<-barplot(PCA_prod$eig[,2], names = 1:nrow(PCA_prod$eig), xlab = "components", ylab = "Percentage of explained variance")
The fact that the first two dimensions capture more variability is also illustrated by the PCA coordinates of the variables. The coordinates of the first dimensions have a broader IQR than the following dimensions:
var <- get_pca_var(PCA_prod)
boxplot(var$coord)
The PCA enables you to inspect the most important dimensions of variability from two different perspectives, namely the "variables" (i.e. the productivity measures) and the "individuals" (i.e. the constructions). In the graph below, the productivity measures are represented with arrows in the "correlation circle".
Variables of which the arrow does not touch the correlation circle are of less representational quality. The coordinates measured at the end of each arrow in the two-dimensional plane correspond to the correlation of that variable with the dimension coinciding with the x-axis and the dimension coinciding with the y-axis, respectively.
(variables_map<-plot(PCA_prod, choix = "var", cex = 0.6))
The dimdesc function is very useful to establish the correlation of each variable with each PCA dimension, accompanied by a p-value.
cat("\n Dimension 1 \n") ; dimdesc(PCA_prod)$Dim.1$quanti
Dimension 1
correlation p.value
avg_summit_10 0.9889440 1.809998e-10
token_freq_max 0.9347225 2.841739e-06
diff_max_avg_10 0.8675979 1.216489e-04
diff_r1_r2 0.7624002 2.444631e-03
hapax_type_ratio -0.6569517 1.470595e-02
hapax_freq -0.8787080 7.681050e-05
dis_legomena_freq -0.9163397 1.072986e-05
type_freq -0.9267635 5.267762e-06
cat("\n Dimension 2 \n ") ; dimdesc(PCA_prod)$Dim.2$quanti
Dimension 2
correlation p.value
diff_r1_r2_norm 0.8357135 0.0003737159
hapax_type_ratio 0.7174699 0.0057633984
diff_r1_r2 0.5901266 0.0337440216
dis_type_ratio -0.8654216 0.0001324813
Next to the graph of the variables, we can also construct a graph of individuals, in which the constructions are positioned with respect to the two most important PCA dimensions:
plot(PCA_prod, cex = 0.8, col.ind = "black", choix = c("ind"))
The plot can be further enhanced by colouring the constructions according to the variable TYPE_CX. The barycentres of the two types of constructions are also added to the plot and confidence ellipses are drawn around the barycentres.
data_prod_2<-data_bis
row.names(data_prod_2)<-data_prod_2[,1]
data_prod_2<-data_prod_2[,c(-1,-3)]
PCA_prod_2<-PCA(data_prod_2, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
plotellipses(PCA_prod_2, cex = 0.8, col.ind = "black", choix = c("ind"), habillage = 1)
It is also possible to cluster the individuals within the two-dimensional PCA space:
data_prod_3<-data_bis
row.names(data_prod_3)<-data_prod_3[,1]
data_prod_3<-data_prod_3[,c(-1:-3)]
PCA_prod_3<-PCA(data_prod_3, graph = FALSE, scale.unit = T, ncp = 2)
res.hcpc <- HCPC(PCA_prod_3, nb.clust = -1, graph = FALSE, method = "ward", metric = "euclidean") # min = 2, max = 10
plot(res.hcpc)
plot(res.hcpc, choice ="map")
Finally, we can try to combine the "variable view" and the "individual view" into one unified biplot:
fviz_pca_biplot(PCA_prod)
Another potentially interesting way to gain insight into the data is to construct a tree regression model. The following tree demonstrates that the difference between the first and second rank (cf. diff_r1_r2) only makes a difference in predicting the hapax frequency for constructions with a large value for avg_summit_10.
tree_model_1<-rpart(hapax_freq ~ avg_summit_10 + diff_r1_r2, data = data_bis, method = "poisson", control = rpart.control(minsplit = 4))
printcp(tree_model_1)
Rates regression tree:
rpart(formula = hapax_freq ~ avg_summit_10 + diff_r1_r2, data = data_bis,
method = "poisson", control = rpart.control(minsplit = 4))
Variables actually used in tree construction:
[1] avg_summit_10 diff_r1_r2
Root node error: 761.82/13 = 58.602
n= 13
CP nsplit rel error xerror xstd
1 0.792650 0 1.000000 1.19896 0.29301
2 0.132581 1 0.207350 0.71054 0.28216
3 0.022071 2 0.074769 0.54311 0.27728
4 0.013246 3 0.052698 0.40769 0.17530
5 0.010000 4 0.039452 0.37189 0.15473
prp(tree_model_1, varlen = 7)
plot(as.party(tree_model_1))
Contrary to the previous tree, this new model only recognizes the variable avg_summit_10 to split the data.
tree_model_2<-glmtree(hapax_freq ~ avg_summit_10 * diff_r1_r2, data = data_bis, epsilon =1, alpha = 0.2, minsize = 1, family = "poisson")
plot(tree_model_2)
In this last part, we attempt to establish a negative binomial regression model, suited for count data. As outcome variable, the productivity measurement hapax frequency is chosen. The explanatory variables are avg_summit_10 and diff_r1_r2. Of course, this model is only based on a limited number of constructions: ideally, we would have more data points to base the estimation on.
mean(data_bis$hapax_freq)
[1] 169.0769
var(data_bis$hapax_freq)
[1] 9264.91
Clearly, the variance of the variable "hapax frequency" far exceeds the mean, which makes a poisson model less appropriate.
Let's now construct the model. Since the sample sizes are equal, we do not include an "offset" term.
nbGLM <- glm.nb(hapax_freq ~ avg_summit_10 + diff_r1_r2, data=data_bis)
summary(nbGLM)
Call:
glm.nb(formula = hapax_freq ~ avg_summit_10 + diff_r1_r2, data = data_bis,
init.theta = 218.4165792, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3703 -0.4386 0.2280 0.6017 1.6944
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.005223 0.051408 116.81 < 2e-16 ***
avg_summit_10 -0.103583 0.005803 -17.85 < 2e-16 ***
diff_r1_r2 0.017756 0.002203 8.06 7.6e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(218.4166) family taken to be 1)
Null deviance: 465.070 on 12 degrees of freedom
Residual deviance: 13.619 on 10 degrees of freedom
AIC: 116.27
Number of Fisher Scoring iterations: 1
Theta: 218
Std. Err.: 226
2 x log-likelihood: -108.273
`
Comparison with models:
nbGLM.null <- glm.nb(hapax_freq ~ 1, data=data_bis)
nbGLM.2 <- glm.nb(hapax_freq ~ avg_summit_10, data=data_bis)
nbGLM.4 <- glm.nb(hapax_freq ~ avg_summit_10 * diff_r1_r2, data=data_bis)
anova(nbGLM.null, nbGLM.2, nbGLM, nbGLM.4, test="Chisq")
Likelihood ratio tests of Negative Binomial Models
Response: hapax_freq
Model theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
1 1 2.360924 12 -155.2185
2 avg_summit_10 12.467794 11 -133.8432 1 vs 2 1 21.375309 3.776029e-06
3 avg_summit_10 + diff_r1_r2 218.416579 10 -108.2730 2 vs 3 1 25.570218 4.265728e-07
4 avg_summit_10 * diff_r1_r2 312.734505 9 -106.9983 3 vs 4 1 1.274685 2.588895e-01
`
The rootogram below indicates to what extent the expected counts correspond to the observed counts:
fitted_hapax_freq<-fitted(nbGLM)
observed_hapax_freq<-data_bis$hapax_freq
rootogram(observed_hapax_freq, fitted_hapax_freq)
In order to interpret the model, we draw an effect plot. The first plot seems to give us unrealistic predictions for the lower values of avg_summit_10:
plot(Effect(focal.predictors = c("avg_summit_10", "diff_r1_r2"), mod = nbGLM, x.var = "avg_summit_10", xlevels=list(avg_summit_10=c(0:30), diff_r1_r2=c(5,30,50))), type="response", lines=list(multiline=TRUE, col=c("blue", "black", "green")), confint=list(style="bars"), ylab="Predicted hapax frequency", grid=TRUE)
However, when we focus on the larger values of avg_summit_10, it becomes again apparent that higher values of diff_r1_r2 should lead to a higher hapax frequency, as we already observed in section 2 above. It must be stressed that this claim is based on a relatively limited dataset.
plot(Effect(focal.predictors = c("avg_summit_10", "diff_r1_r2"), mod = nbGLM, x.var = "avg_summit_10", xlevels=list(avg_summit_10=c(15:30), diff_r1_r2=c(5,30,50))), type="response", lines=list(multiline=TRUE, col=c("blue", "black", "green")), confint=list(style="bars"), ylab="Predicted hapax frequency", grid=TRUE)
The plot above can also be visualized as follows (but the message remains the same):
plot(Effect(focal.predictors = c("avg_summit_10", "diff_r1_r2"), mod = nbGLM, x.var = "diff_r1_r2", xlevels=list(avg_summit_10=c(20,30), diff_r1_r2=c(1:60))), type="response", lines=list(multiline=TRUE, col=c("blue", "black")), confint=list(style="bars"), ylab="Predicted hapax frequency", grid=TRUE)
In this last section, we evaluate the model by means of some model diagnostics. First, we inspect the residuals:
plot(rstandard(nbGLM, type="pearson") ~ fitted(nbGLM), xlab = "fitted values", ylab="Standardized Pearson Residuals")
abline(h=0)
testDispersion(nbGLM)
DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated
data: simulationOutput
ratioObsSim = 0.97167, p-value = 0.928
alternative hypothesis: two.sided
simulationOutput<-simulateResiduals(fittedModel = nbGLM, plot = F)
plot(simulationOutput, quantreg = F)
We also check for influential data points:
influencePlot(nbGLM)
The following code checks for multi-collinearity, heteroscedasticity and autocorrelation:
check_collinearity(nbGLM)
[34m# Check for Multicollinearity
[39m
[32mLow Correlation
[39m Parameter VIF Increased SE
avg_summit_10 2.11 1.45
diff_r1_r2 2.11 1.45
check_heteroscedasticity(nbGLM)
[32mOK: Error variance appears to be homoscedastic (p = 0.565).
[39m
check_autocorrelation(nbGLM)
[31mWarning: Autocorrelated residuals detected (p = 0.000).[39m
print(citation('dplyr'), bibtex = FALSE)
To cite package ‘dplyr’ in publications use:
Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.0.
https://CRAN.R-project.org/package=dplyr
print(citation('zipfR'), bibtex = FALSE)
To cite package 'zipfR' in publications use:
Evert, Stefan and Baroni, Marco (2007). "zipfR: Word frequency distributions in R." In Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics, Posters and Demonstrations Sessions, pages 29-32, Prague, Czech Republic. (R package version
0.6-66 of 2019-09-30)
Thanks for crediting our work! We hope that you enjoy using the package and would like to hear your comments, criticism and suggestions for
future enhancements.
print(citation("ggplot2"), bibtex = FALSE)
To cite ggplot2 in publications, please use:
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
print(citation('FactoMineR'), bibtex = FALSE)
To cite FactoMineR in publications use:
Sebastien Le, Julie Josse, Francois Husson (2008). FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software,
25(1), 1-18. 10.18637/jss.v025.i01
print(citation('factoextra'), bibtex = FALSE)
To cite package ‘factoextra’ in publications use:
Alboukadel Kassambara and Fabian Mundt (2020). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package
version 1.0.7. https://CRAN.R-project.org/package=factoextra
print(citation("corrplot"), bibtex = FALSE)
To cite corrplot in publications use:
Taiyun Wei and Viliam Simko (2017). R package "corrplot": Visualization of a Correlation Matrix (Version 0.84). Available from
https://github.com/taiyun/corrplot
print(citation("PerformanceAnalytics"), bibtex = FALSE)
To cite package ‘PerformanceAnalytics’ in publications use:
Brian G. Peterson and Peter Carl (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version
2.0.4. https://CRAN.R-project.org/package=PerformanceAnalytics
print(citation("MASS"), bibtex = FALSE)
To cite the MASS package in publications use:
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
print(citation("effects"), bibtex = FALSE)
To cite effects in publications use:
John Fox and Sanford Weisberg (2019). An R Companion to Applied Regression, 3rd Edition. Thousand Oaks, CA <http://tinyurl.com/carbook>
For predictor effects or partial residuals also cite:
John Fox, Sanford Weisberg (2018). Visualizing Fit and Lack of Fit in Complex Regression Models with Predictor Effect Plots and Partial
Residuals. Journal of Statistical Software, 87(9), 1-27. URL https://www.jstatsoft.org/v087/i09.
For generalized linear models also cite:
John Fox (2003). Effect Displays in R for Generalised Linear Models. Journal of Statistical Software, 8(15), 1-27. URL
http://www.jstatsoft.org/v08/i15/.
For usage in multinomial and proportional-odds logit models also cite:
John Fox, Jangman Hong (2009). Effect Displays in R for Multinomial and Proportional-Odds Logit Models: Extensions to the effects Package.
Journal of Statistical Software, 32(1), 1-24. URL http://www.jstatsoft.org/v32/i01/.
print(citation("vcd"), bibtex = FALSE)
To cite package vcd in publications use:
David Meyer, Achim Zeileis, and Kurt Hornik (2020). vcd: Visualizing Categorical Data. R package version 1.4-7.
To cite the strucplot framework (e.g., functions mosaic(), sieve(), assoc(), strucplot(), structable(), pairs.table(), cotabplot(),
doubledecker()), additionally use:
David Meyer, Achim Zeileis, and Kurt Hornik (2006). The Strucplot Framework: Visualizing Multi-Way Contingency Tables with vcd. Journal of
Statistical Software, 17(3), 1-48. URL http://www.jstatsoft.org/v17/i03/
If you use the residual-based shadings (in mosaic() or assoc()), please cite:
Achim Zeileis, David Meyer, and Kurt Hornik (2007). Residual-based Shadings for Visualizing (Conditional) Independence. Journal of
Computational and Graphical Statistics, 16(3), 507-525.
print(citation("car"), bibtex = FALSE)
To cite the car package in publications use:
John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL:
https://socialsciences.mcmaster.ca/jfox/Books/Companion/
print(citation("performance"), bibtex = FALSE)
To cite package ‘performance’ in publications use:
Daniel Lüdecke, Dominique Makowski, Philip Waggoner and Indrajeet Patil (2020). performance: Assessment of Regression Models Performance. R
package version 0.4.7. https://CRAN.R-project.org/package=performance
print(citation("DHARMa"), bibtex = FALSE)
To cite package ‘DHARMa’ in publications use:
Florian Hartig (2020). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.3.2.0.
https://CRAN.R-project.org/package=DHARMa
print(citation("partykit"), bibtex = FALSE)
To cite partykit in publications use:
Torsten Hothorn, Achim Zeileis (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research,
16, 3905-3909. URL http://jmlr.org/papers/v16/hothorn15a.html
If ctree() is used additionally cite:
Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of
Computational and Graphical Statistics, 15(3), 651--674, DOI: 10.1198/106186006X133933
If mob() is used additionally cite:
Achim Zeileis, Torsten Hothorn and Kurt Hornik (2008). Model-Based Recursive Partitioning. Journal of Computational and Graphical
Statistics, 17(2), 492--514, DOI: 10.1198/106186008X319331
print(citation("rpart"), bibtex = FALSE)
To cite package ‘rpart’ in publications use:
Terry Therneau and Beth Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15.
https://CRAN.R-project.org/package=rpart
print(citation("rpart.plot"), bibtex = FALSE)
To cite package ‘rpart.plot’ in publications use:
Stephen Milborrow (2019). rpart.plot: Plot 'rpart' Models: An Enhanced Version of 'plot.rpart'. R package version 3.0.8.
https://CRAN.R-project.org/package=rpart.plot
ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see
‘help("citation")’.
sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=Dutch_Belgium.1252 LC_CTYPE=Dutch_Belgium.1252 LC_MONETARY=Dutch_Belgium.1252 LC_NUMERIC=C LC_TIME=Dutch_Belgium.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] rpart.plot_3.0.8 rpart_4.1-15 partykit_1.2-8 mvtnorm_1.1-1 libcoin_1.0-5
[6] DHARMa_0.3.2.0 performance_0.4.7 car_3.0-8 vcd_1.4-7 effects_4.1-4
[11] carData_3.0-4 MASS_7.3-51.6 PerformanceAnalytics_2.0.4 xts_0.12-0 zoo_1.8-8
[16] corrplot_0.84 factoextra_1.0.7 FactoMineR_2.3 zipfR_0.6-66 ggplot2_3.3.2
[21] dplyr_1.0.0
loaded via a namespace (and not attached):
[1] nlme_3.1-148 insight_0.8.5 tools_4.0.1 backports_1.1.7 R6_2.4.1 DBI_1.1.0 mgcv_1.8-31
[8] colorspace_1.4-1 nnet_7.3-14 withr_2.2.0 tidyselect_1.1.0 curl_4.3 compiler_4.0.1 flashClust_1.01-2
[15] labeling_0.3 bayestestR_0.7.0 scales_1.1.1 lmtest_0.9-37 quadprog_1.5-8 digest_0.6.25 foreign_0.8-80
[22] minqa_1.2.4 rio_0.5.16 pkgconfig_2.0.3 lme4_1.1-23 rlang_0.4.6 readxl_1.3.1 rstudioapi_0.11
[29] farver_2.0.3 generics_0.0.2 zip_2.0.4 magrittr_1.5 Formula_1.2-3 leaps_3.1 Matrix_1.2-18
[36] Rcpp_1.0.4.6 munsell_0.5.0 abind_1.4-5 lifecycle_0.2.0 scatterplot3d_0.3-41 stringi_1.4.6 yaml_2.2.1
[43] inum_1.0-1 ggrepel_0.8.2 forcats_0.5.0 crayon_1.3.4 lattice_0.20-41 haven_2.3.1 splines_4.0.1
[50] hms_0.5.3 knitr_1.29 pillar_1.4.4 ggpubr_0.3.0 boot_1.3-25 estimability_1.3 ggsignif_0.6.0
[57] codetools_0.2-16 glue_1.4.1 mitools_2.4 data.table_1.12.8 vctrs_0.3.1 nloptr_1.2.2.1 foreach_1.5.0
[64] cellranger_1.1.0 gtable_0.3.0 purrr_0.3.4 tidyr_1.1.0 xfun_0.15 openxlsx_4.1.5 broom_0.5.6
[71] survey_4.0 rstatix_0.6.0 survival_3.1-12 tibble_3.0.1 iterators_1.0.12 ellipse_0.4.2 cluster_2.1.0
[78] statmod_1.4.34 ellipsis_0.3.1 gap_1.2.2