r相关系数和相关性测试

Antoine Soetewey 2020-05-28 15 minute read

介绍

变量之间的相关性在a中发挥着重要作用 描述性分析。相关性测量两个变量之间的关系,即它们如何彼此链接。从这个意义上讲,相关性允许知道哪些变量在相同方向上发展,该变量在相反的方向上发展,并且哪个是独立的。

在本文中,我展示了如何计算相关系数,如何执行相关性测试以及如何在R中的变量之间的关系可视化关系。

相关性通常在两个中计算 定量 变量。看看 Chi-Square独立性测试 如果您需要研究两个之间的关系 定性的 variables.

数据

In this article, we use the mtcars dataset (loaded by default in R):

# display first 5 observations
head(mtcars, 5)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

这 variables vsam are categorical variables, so they are removed for this article:

# remove vs and am variables
library(tidyverse)
dat <- mtcars %>%
  select(-vs, -am)

# display 5 first obs. of new dataset
head(dat, 5)
##                    mpg cyl disp  hp drat    wt  qsec gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02    3    2

相关系数

在两个变量之间

这 correlation between 2 variables is found with the cor() function. Suppose we want to compute the correlation between horsepower (hp) and miles per gallon (mpg):

# Pearson correlation between 2 variables
cor(dat$hp, dat$mpg)
## [1] -0.7761684

请注意,变量之间的相关性 xy 等于变量之间的相关性 yx so the order of the variables in the cor() function does not matter.

这 Pearson correlation is computed by default with the cor() function. If you want to compute the Spearman correlation, add the argument method = "spearman" to the cor() function:

# Spearman correlation between 2 variables
cor(dat$hp, dat$mpg,
  method = "spearman"
)
## [1] -0.8946646

这re are several correlation methods (Run ?cor for more information about the different methods available in the cor() function):

  • Pearson相关性通常用于 定量连续 具有线性关系的变量
  • Spearman相关性(其实际上与Pearson类似,但基于每个变量的排名值而不是原始数据)通常用于评估涉及的关系 定性序单 如果链接部分是线性的,则变量或定量变量
  • 从协调和不和谐对的数量计算的肯德尔通常用于定性序数变量

相关矩阵:所有变量的相关性

Suppose now that we want to compute correlations for several pairs of variables. We can easily do so for all possible pairs of variables in the dataset, again with the cor() function:

# correlation for all variables
round(cor(dat),
  digits = 2 # rounded to 2 decimals
)
##        mpg   cyl  disp    hp  drat    wt  qsec  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00 -0.21 -0.66
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66  0.27  1.00

该相关矩阵概述了两个变量的所有组合的相关性。

诠释相​​关系数

首先,相关性范围 -1到1.

一方面,负相关意味着正在考虑的两个变量在相反的方向上变化,即,如果变量增加其他减少,则反之亦然。另一方面,正相关意味着所考虑的两个变量在相同的方向上变化,即,如果变量增加另一个增加,并且如果一个人也会减少另一个。最后但尤其是截止值接近0的相关性表示两个变量是独立的。

As an illustration, the Pearson correlation between horsepower (hp) and miles per gallon (mpg) found above is -0.78, meaning that the 2 variables vary in opposite direction. This makes sense, cars with more horsepower tend to consume more fuel (and thus have a lower millage par gallon). On the contrary, from the correlation matrix we see that the correlation between miles per gallon (mpg) and the time to drive 1/4 of a mile (qsec) is 0.42, meaning that fast cars (low qsec) tend to have a worse millage per gallon (low mpg). This again make sense as fast cars tend to consume more fuel.

然而,相关矩阵不容易解释,尤其是当数据集由许多变量组成时。在以下部分中,我们向相关矩阵呈现了一些替代方案。

可视化

2个变量的散点图

A good way to visualize a correlation between 2 variables is to draw a scatterplot of the two variables of interest. Suppose we want to examine the relationship between horsepower (hp) and miles per gallon (mpg):

# scatterplot
library(ggplot2)

ggplot(dat) +
  aes(x = hp, y = mpg) +
  geom_point(colour = "#0c4c8a") +
  theme_minimal()

如果你不熟悉 {ggplot2} package, you can draw the scatterplot using the plot() function from R base graphics:

plot(dat$hp, dat$mpg)

或者使用 Esquisse addin. to easily draw plots using the {ggplot2} package.

几对变量的散点图

Suppose that instead of visualizing the relationship between only 2 variables, we want to visualize the relationship for several pairs of variables. This is possible thanks to the pair() function. For this illustration, we focus only on miles per gallon (mpg), horsepower (hp) and weight (wt):

# multiple scatterplots
pairs(dat[, c(1, 4, 6)])

这 figure indicates that weight (wt) and horsepower (hp) are positively correlated, whereas miles per gallon (mpg) seems to be negatively correlated with horsepower (hp) and weight (wt).

另一种简单的相关矩阵

该版本的相关矩阵以稍微可读的方式呈现相关系数,即,通过基于其符号着色系数彩色。应用于我们的数据集,我们有:

# improved correlation matrix
library(corrplot)

corrplot(cor(dat),
  method = "number",
  type = "upper" # show only upper side
)

相关性测试

对于2个变量

与指示变量对之间的相关系数的相关矩阵不同,相关测试用于测试相关性是否(表示 \(\ rho \))2个变量与0或不显着不同。

实际上,与0不同的相关系数并不意味着相关性是 显着地 与0不同。这需要通过相关测试进行测试。相关测试的空和替代假设如下:

  • \(h_0 \): \(\ rho = 0 \)
  • \(H_1): \(\ rho \ ne 0 \)

Suppose that we want to test whether the rear axle ratio (drat) is correlated with the time to drive a quarter of a mile (qsec):

# Pearson correlation test
test <- cor.test(dat$drat, dat$qsec)
test
## 
##  Pearson's product-moment correlation
## 
## data:  dat$drat and dat$qsec
## t = 0.50164, df = 30, p-value = 0.6196
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.265947  0.426340
## sample estimates:
##        cor 
## 0.09120476

p - 这些2变量之间的相关试验值为0.62。在5%的重要性水平下,我们不会拒绝无相关的无效假设。因此,我们得出结论,我们不拒绝假设,即2个变量之间没有线性关系。

该测试证明,即使相关系数与0不同(相关性为0.09),它实际上没有显着不同于0。

请注意 p - 相关性测试的值基于相关系数 样本大小。样本大小越大,相关性更极近(更接近-1或1),所以不会拒绝任何相关性的无效假设。具有小的样本尺寸,因此可以获得一个 相对地 大相关(基于相关系数),但仍然仍然发现与0(基于相关测试)没有显着不同的相关性。因此,建议在解释相关系数之前始终进行相关测试以避免缺陷的结论。

A nice and easy way to report results of a correlation test in R is with the report() function from the {report} package:

# install.packages("remotes")
# remotes::install_github("easystats/report") # You only need to do that once
library("report") # Load the package every time you start R
report(test)
## Effect sizes were labelled following Funder's (2019) recommendations.
## 
## The Pearson's product-moment correlation between dat$drat and dat$qsec is positive, not significant and very small (r = 0.09, 95% CI [-0.27, 0.43], t(30) = 0.50, p = 0.620)

如您所见,该功能将测试解释(以及相关系数和相关系数和 p - 为您服务。

请注意 report() function can be used for other analyses. See more examples in the package’s 文件。还有更多 r的提示和技巧 如果你发现这个有用。

对于几对变量

Similar to the correlation matrix used to compute correlation for several pairs of variables, the rcorr() function (from the {Hmisc} package) allows to compute p - 同时对几对变量的相关性测试的值。应用于我们的数据集,我们有:

# correlation tests for whole dataset
library(Hmisc)
res <- rcorr(as.matrix(dat)) # rcorr() accepts matrices only

# display p-values (rounded to 3 decimals)
round(res$P, 3)
##        mpg   cyl  disp    hp  drat    wt  qsec  gear  carb
## mpg     NA 0.000 0.000 0.000 0.000 0.000 0.017 0.005 0.001
## cyl  0.000    NA 0.000 0.000 0.000 0.000 0.000 0.004 0.002
## disp 0.000 0.000    NA 0.000 0.000 0.000 0.013 0.001 0.025
## hp   0.000 0.000 0.000    NA 0.010 0.000 0.000 0.493 0.000
## drat 0.000 0.000 0.000 0.010    NA 0.000 0.620 0.000 0.621
## wt   0.000 0.000 0.000 0.000 0.000    NA 0.339 0.000 0.015
## qsec 0.017 0.000 0.013 0.000 0.620 0.339    NA 0.243 0.000
## gear 0.005 0.004 0.001 0.493 0.000 0.000 0.243    NA 0.129
## carb 0.001 0.002 0.025 0.000 0.621 0.015 0.000 0.129    NA

只与之相关 p - 比显着性水平小的值(通常是 \(\ alpha = 0.05 \))应该被解释。

相关系数和相关试验的组合

现在我们涵盖了相关系数和相关测试的概念,让我们看看我们是否可以组合这两个概念。

correlation function from the easystats {correlation} package 允许将相关系数和相关性测试组合在单个表中(谢谢 Krzysiektr. 向我指向我):

library(correlation)

correlation::correlation(dat,
  include_factors = TRUE, method = "auto"
)
## Parameter1 | Parameter2 |     r |         95% CI | t(30) |      p |                               Method | n_Obs
## ----------------------------------------------------------------------------------------------------------------
## mpg        |        cyl | -0.85 | [-0.93, -0.72] | -8.92 | < .001 | Pearson's product-moment correlation |    32
## mpg        |       disp | -0.85 | [-0.92, -0.71] | -8.75 | < .001 | Pearson's product-moment correlation |    32
## mpg        |         hp | -0.78 | [-0.89, -0.59] | -6.74 | < .001 | Pearson's product-moment correlation |    32
## mpg        |       drat |  0.68 | [ 0.44,  0.83] |  5.10 | < .001 | Pearson's product-moment correlation |    32
## mpg        |         wt | -0.87 | [-0.93, -0.74] | -9.56 | < .001 | Pearson's product-moment correlation |    32
## mpg        |       qsec |  0.42 | [ 0.08,  0.67] |  2.53 | 0.137  | Pearson's product-moment correlation |    32
## mpg        |       gear |  0.48 | [ 0.16,  0.71] |  3.00 | 0.065  | Pearson's product-moment correlation |    32
## mpg        |       carb | -0.55 | [-0.75, -0.25] | -3.62 | 0.016  | Pearson's product-moment correlation |    32
## cyl        |       disp |  0.90 | [ 0.81,  0.95] | 11.45 | < .001 | Pearson's product-moment correlation |    32
## cyl        |         hp |  0.83 | [ 0.68,  0.92] |  8.23 | < .001 | Pearson's product-moment correlation |    32
## cyl        |       drat | -0.70 | [-0.84, -0.46] | -5.37 | < .001 | Pearson's product-moment correlation |    32
## cyl        |         wt |  0.78 | [ 0.60,  0.89] |  6.88 | < .001 | Pearson's product-moment correlation |    32
## cyl        |       qsec | -0.59 | [-0.78, -0.31] | -4.02 | 0.007  | Pearson's product-moment correlation |    32
## cyl        |       gear | -0.49 | [-0.72, -0.17] | -3.10 | 0.054  | Pearson's product-moment correlation |    32
## cyl        |       carb |  0.53 | [ 0.22,  0.74] |  3.40 | 0.027  | Pearson's product-moment correlation |    32
## disp       |         hp |  0.79 | [ 0.61,  0.89] |  7.08 | < .001 | Pearson's product-moment correlation |    32
## disp       |       drat | -0.71 | [-0.85, -0.48] | -5.53 | < .001 | Pearson's product-moment correlation |    32
## disp       |         wt |  0.89 | [ 0.78,  0.94] | 10.58 | < .001 | Pearson's product-moment correlation |    32
## disp       |       qsec | -0.43 | [-0.68, -0.10] | -2.64 | 0.131  | Pearson's product-moment correlation |    32
## disp       |       gear | -0.56 | [-0.76, -0.26] | -3.66 | 0.015  | Pearson's product-moment correlation |    32
## disp       |       carb |  0.39 | [ 0.05,  0.65] |  2.35 | 0.177  | Pearson's product-moment correlation |    32
## hp         |       drat | -0.45 | [-0.69, -0.12] | -2.75 | 0.110  | Pearson's product-moment correlation |    32
## hp         |         wt |  0.66 | [ 0.40,  0.82] |  4.80 | < .001 | Pearson's product-moment correlation |    32
## hp         |       qsec | -0.71 | [-0.85, -0.48] | -5.49 | < .001 | Pearson's product-moment correlation |    32
## hp         |       gear | -0.13 | [-0.45,  0.23] | -0.69 | > .999 | Pearson's product-moment correlation |    32
## hp         |       carb |  0.75 | [ 0.54,  0.87] |  6.21 | < .001 | Pearson's product-moment correlation |    32
## drat       |         wt | -0.71 | [-0.85, -0.48] | -5.56 | < .001 | Pearson's product-moment correlation |    32
## drat       |       qsec |  0.09 | [-0.27,  0.43] |  0.50 | > .999 | Pearson's product-moment correlation |    32
## drat       |       gear |  0.70 | [ 0.46,  0.84] |  5.36 | < .001 | Pearson's product-moment correlation |    32
## drat       |       carb | -0.09 | [-0.43,  0.27] | -0.50 | > .999 | Pearson's product-moment correlation |    32
## wt         |       qsec | -0.17 | [-0.49,  0.19] | -0.97 | > .999 | Pearson's product-moment correlation |    32
## wt         |       gear | -0.58 | [-0.77, -0.29] | -3.93 | 0.008  | Pearson's product-moment correlation |    32
## wt         |       carb |  0.43 | [ 0.09,  0.68] |  2.59 | 0.132  | Pearson's product-moment correlation |    32
## qsec       |       gear | -0.21 | [-0.52,  0.15] | -1.19 | > .999 | Pearson's product-moment correlation |    32
## qsec       |       carb | -0.66 | [-0.82, -0.40] | -4.76 | < .001 | Pearson's product-moment correlation |    32
## gear       |       carb |  0.27 | [-0.08,  0.57] |  1.56 | 0.774  | Pearson's product-moment correlation |    32
## 
## p-value adjustment method: Holm (1979)

As you can see, it gives, among other useful information, the correlation coefficients (column r) and the result of the correlation test (column 95% CI for the confidence interval or p for the \(p \)-value)对于所有成对的变量。此表非常有用和信息性,但允许在一个可视化中组合相关系数和相关性测试的概念。一种可视化,易于阅读和解释。

理想地,我们希望简明地概述数据集中存在于数据集中的所有可能对变量之间的相关性,并且对于显着不同的相关性与0显着不同。

下图,称为a 相互言 和 adapted from the corrplot() function, does precisely this:

corrplot2 <- function(data,
                      method = "pearson",
                      sig.level = 0.05,
                      order = "original",
                      diag = FALSE,
                      type = "upper",
                      tl.srt = 90,
                      number.font = 1,
                      number.cex = 1,
                      mar = c(0, 0, 0, 0)) {
  library(corrplot)
  data_incomplete <- data
  data <- data[complete.cases(data), ]
  mat <- cor(data, method = method)
  cor.mtest <- function(mat, method) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
      for (j in (i + 1):n) {
        tmp <- cor.test(mat[, i], mat[, j], method = method)
        p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      }
    }
    colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
    p.mat
  }
  p.mat <- cor.mtest(data, method = method)
  col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
  corrplot(mat,
    method = "color", col = col(200), number.font = number.font,
    mar = mar, number.cex = number.cex,
    type = type, order = order,
    addCoef.col = "black", # add correlation coefficient
    tl.col = "black", tl.srt = tl.srt, # rotation of text labels
    # combine with significance level
    p.mat = p.mat, sig.level = sig.level, insig = "blank",
    # hide correlation coefficients on the diagonal
    diag = diag
  )
}

corrplot2(
  data = dat,
  method = "pearson",
  sig.level = 0.05,
  order = "original",
  diag = FALSE,
  type = "upper",
  tl.srt = 75
)

相关图示出了所有变量对的相关系数(具有更强烈的颜色以获得更极端的相关性),并且与0的相关性没有显着不同的相关性由白色盒子表示。

要了解有关此绘图的更多信息和所使用的代码,我邀请您阅读题为“R:如何突出显示数据集中最相关的变量“。

谢谢阅读。我希望这篇文章有助于您计算相关性并在R中执行相关测试。

一如既往,如果您有问题或与本文所涵盖的主题相关的建议,请将其添加为评论,以便其他读者可以从讨论中受益。



喜欢这篇文章?

获取更新 每次发布新文章。
任何垃圾邮件都没有任何垃圾邮件。
分享: