r的描述性统计

Antoine Soetewey 2020-01-22 31 minute read

介绍

本文介绍如何计算R中的主要描述性统计信息以及如何以图形方式呈现它们。要了解有关每个描述性统计数据背后的推理的更多信息,如何手动计算它们以及如何解释它们,阅读文章“ 手工描述性统计“。

要简要回顾该文章中所说的,描述性统计(在广泛的术语中)是统计,描述和呈现一系列值或数据集的统计数据分支。描述性统计数据通常是统计分析中的第一步和重要部分。它允许检查数据的质量,有助于通过清楚地概述它“了解”数据。如果呈现良好,描述性统计数据已经是进一步分析的良好起点。存在许多措施来总结数据集。它们分为两种类型:

  1. 位置措施和
  2. 分散措施

位置测量对数据的中央趋势进行了解,而色散措施对数据的传播提供了理解。在本文中,我们仅关注最常见的描述性统计数据及其可视化的实施(当被视为适当时)。在线或在线查看 上面提到的文章 有关每种措施的目的和用法的更多信息。

数据

We use the dataset iris throughout the article. This dataset is imported by default in R, you only need to load it by running iris:

dat <- iris # load the iris dataset and renamed it dat

下面是此数据集的预览及其结构:

head(dat) # first 6 observations
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(dat) # structure of dataset
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

这 dataset contains 150 observations and 5 variables, representing the length and width of the sepal and petal and the species of 150 flowers. Length and width of the sepal and petal are numeric variables and the species is a factor with 3 levels (indicated by numFactor w/ 3 levels after the name of the variables). See the r中的不同变量类型 如果您需要刷新。

Regarding plots, we present the default graphs and the graphs from the well-known {ggplot2} package. Graphs from the {ggplot2} 包裹 usually have a better look but it requires more advanced coding skills (see the article “ggplot2的图形” to learn more). If you need to publish or share your graphs, I suggest using {ggplot2} if you can, otherwise the default graphics will do the job.

小费: I recently discovered the ggplot2 builder from the {esquisse} addins. See how you can easily draw graphs from the {ggplot2} package 不必自己编写代码。

本文中显示的所有图形都可以自定义。例如,可以编辑标题,x和y轴标签,颜色等,但是,定制图超出了本文的范围,因此所有图都显示出没有任何定制。有兴趣的读者会在线找到众多资源。

最小和最大值

最小和最大值 can be found thanks to the min()max() functions:

min(dat$Sepal.Length)
## [1] 4.3
max(dat$Sepal.Length)
## [1] 7.9

Alternatively the range() function:

rng <- range(dat$Sepal.Length)
rng
## [1] 4.3 7.9

gives you the minimum and maximum directly. Note that the output of the range() 功能 is actually an object containing the minimum and maximum (in that order). This means you can actually access the minimum with:

rng[1] # rng = name of the object specified above
## [1] 4.3

最大限度地:

rng[2]
## [1] 7.9

这提醒我们,在r中,通常有几种方法可以同时到达。使用最短代码的方法通常优选,因为较短的代码不太容易易于编码错误和更可读。

范围

然后可以通过从最大值中减去最小值时轻松计算该范围:

max(dat$Sepal.Length) - min(dat$Sepal.Length)
## [1] 3.6

据我所知,没有默认函数来计算范围。但是,如果您熟悉r中的写作功能 ,您可以创建自己的函数来计算范围:

range2 <- function(x) {
  range <- max(x) - min(x)
  return(range)
}

range2(dat$Sepal.Length)
## [1] 3.6

哪个相当于 \(max - min \) presented above.

意思是

这 mean can be computed with the mean() function:

mean(dat$Sepal.Length)
## [1] 5.843333

尖端:

  • if there is at least one missing value in your dataset, use mean(dat$Sepal.Length, na.rm = TRUE) to compute the mean with the NA excluded. This argument can be used for most functions presented in this article, not only the mean
  • for a truncated mean, use mean(dat$Sepal.Length, trim = 0.10) 和 change the trim argument to your needs

中位数

这 median can be computed thanks to the median() function:

median(dat$Sepal.Length)
## [1] 5.8

or with the quantile() function:

quantile(dat$Sepal.Length, 0.5)
## 50% 
## 5.8

由于达到0.5的大分子((Q_ {0.5} \))对应于中位数。

第一和第三个四分位数

As the median, the first and third quartiles can be computed thanks to the quantile() 功能 and by setting the second argument to 0.25 or 0.75:

quantile(dat$Sepal.Length, 0.25) # first quartile
## 25% 
## 5.1
quantile(dat$Sepal.Length, 0.75) # third quartile
## 75% 
## 6.4

您可能已经看到,如果您计算第一个和第三个四分位数,则上面的结果略有不同于您发现的结果 用手。它是正常的,有许多方法来计算它们(R实际上有7种来计算量数!)。但是,这里和文章呈现的方法“手工描述性统计“是最简单,最”的标准“。此外,两种方法之间的结果不会显着变化。

其他量级

As you have guessed, any quantile can also be computed with the quantile() function. For instance, the \(4 ^ {th} \) 十字架或者 \(98 ^ {th} \) percentile:

quantile(dat$Sepal.Length, 0.4) # 4th decile
## 40% 
## 5.6
quantile(dat$Sepal.Length, 0.98) # 98th percentile
## 98% 
## 7.7

狭窄的范围

这 interquartile range (i.e., the difference between the first and third quartile) can be computed with the IQR() function:

IQR(dat$Sepal.Length)
## [1] 1.3

or alternatively with the quantile() 功能 again:

quantile(dat$Sepal.Length, 0.75) - quantile(dat$Sepal.Length, 0.25)
## 75% 
## 1.3

As mentioned earlier, when possible it is usually recommended to use the shortest piece of code to arrive at the result. For this reason, the IQR() 功能 is preferred to compute the interquartile range.

标准偏差和方差

这 standard deviation and the variance is computed with the sd()var() functions:

sd(dat$Sepal.Length) # standard deviation
## [1] 0.8280661
var(dat$Sepal.Length) # variance
## [1] 0.6856935

记得文章 手工描述性统计 无论我们是否为样本或人口计算它,标准偏差和方差都不同(参见 样品和人口之间的差异)。在R中,计算标准偏差和方差,好像数据代表样本(所以分母是 \(n - 1 \), 在哪里 \(n \) 是观察人数)。据我所知,默认情况下,r默认情况下没有功能计算群体的标准偏差或方差。

小费: to compute the standard deviation (or variance) of multiple variables at the same time, use lapply() with the appropriate statistics as second argument:

lapply(dat[, 1:4], sd)
## $Sepal.Length
## [1] 0.8280661
## 
## $Sepal.Width
## [1] 0.4358663
## 
## $Petal.Length
## [1] 1.765298
## 
## $Petal.Width
## [1] 0.7622377

这 command dat[, 1:4] selects the variables 1 to 4 as the fifth variable is a 定性变量 并且标准偏差不能在这种类型的变量上计算。看一个不同的回顾 r的数据类型 if needed.

概括

您可以计算最低, \(1 ^ {st}) 四分位数,中位数,意思, \(3 ^ {rd} \) quartile and the maximum for all numeric variables of a dataset at once using summary():

summary(dat)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

小费: if you need these descriptive statistics by group use the by() function:

by(dat, dat$Species, summary)
## dat$Species: setosa
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
##                 
## ------------------------------------------------------------ 
## dat$Species: versicolor
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
## ------------------------------------------------------------ 
## dat$Species: virginica
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    : 0  
##  versicolor: 0  
##  virginica :50  
##                 
##                 
## 

参数是数据集的名称,分组变量和摘要函数。按照此顺序,或指定参数的名称如果您不按照此订单。

If you need more descriptive statistics, use stat.desc() from the package {pastecs}:

library(pastecs)
stat.desc(dat)
##              Sepal.Length  Sepal.Width Petal.Length  Petal.Width Species
## nbr.val      150.00000000 150.00000000  150.0000000 150.00000000      NA
## nbr.null       0.00000000   0.00000000    0.0000000   0.00000000      NA
## nbr.na         0.00000000   0.00000000    0.0000000   0.00000000      NA
## min            4.30000000   2.00000000    1.0000000   0.10000000      NA
## max            7.90000000   4.40000000    6.9000000   2.50000000      NA
## range          3.60000000   2.40000000    5.9000000   2.40000000      NA
## sum          876.50000000 458.60000000  563.7000000 179.90000000      NA
## median         5.80000000   3.00000000    4.3500000   1.30000000      NA
## mean           5.84333333   3.05733333    3.7580000   1.19933333      NA
## SE.mean        0.06761132   0.03558833    0.1441360   0.06223645      NA
## CI.mean.0.95   0.13360085   0.07032302    0.2848146   0.12298004      NA
## var            0.68569351   0.18997942    3.1162779   0.58100626      NA
## std.dev        0.82806613   0.43586628    1.7652982   0.76223767      NA
## coef.var       0.14171126   0.14256420    0.4697441   0.63555114      NA

You can have even more statistics (i.e., skewness, kurtosis and normality test) by adding the argument norm = TRUE in the previous function. Note that the variable Species is not numeric, so descriptive statistics cannot be computed for this variable and NA are displayed.

变异系数

这 coefficient of variation can be found with stat.desc() (see the line coef.var in the table above) or by computing manually (remember that the coefficient of variation is the standard deviation divided by the mean):

sd(dat$Sepal.Length) / mean(dat$Sepal.Length)
## [1] 0.1417113

模式

To my knowledge there is no function to find the mode of a variable. However, we can easily find it thanks to the functions table()sort():

tab <- table(dat$Sepal.Length) # number of occurrences for each unique value
sort(tab, decreasing = TRUE) # sort highest to lowest
## 
##   5 5.1 6.3 5.7 6.7 5.5 5.8 6.4 4.9 5.4 5.6   6 6.1 4.8 6.5 4.6 5.2 6.2 6.9 7.7 
##  10   9   9   8   8   7   7   7   6   6   6   6   6   5   5   4   4   4   4   4 
## 4.4 5.9 6.8 7.2 4.7 6.6 4.3 4.5 5.3   7 7.1 7.3 7.4 7.6 7.9 
##   3   3   3   3   2   2   1   1   1   1   1   1   1   1   1

table() gives the number of occurrences for each unique value, then sort() with the argument decreasing = TRUE displays the number of occurrences from highest to lowest. The mode of the variable Sepal.Length is thus 5. This code to find the mode can also be applied to qualitative variables such as Species:

sort(table(dat$Species), decreasing = TRUE)
## 
##     setosa versicolor  virginica 
##         50         50         50

或者:

summary(dat$Species)
##     setosa versicolor  virginica 
##         50         50         50

相关性

另一个描述性统计数据是相关系数。相关性测量 线性 两个变量之间的关系。

计算中的计算相关性需要详细的解释,因此我写了一篇文章 相关性和相关性测试.

应急表

table() introduced above can also be used on two qualitative variables to create a contingency table. The dataset iris has only one qualitative variable so we create a new qualitative variable just for this example. We create the variable size which corresponds to small if the length of the petal is smaller than the median of all flowers, big otherwise:

dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length),
  "small", "big"
)

以下是按大小的常用次数:

table(dat$size)
## 
##   big small 
##    77    73

We now create a contingency table of the two variables Speciessize with the table() function:

table(dat$Species, dat$size)
##             
##              big small
##   setosa       1    49
##   versicolor  29    21
##   virginica   47     3

or with the xtabs() function:

xtabs(~ dat$Species + dat$size)
##             dat$size
## dat$Species  big small
##   setosa       1    49
##   versicolor  29    21
##   virginica   47     3

偶然表给出了每个子组中的案例数。例如,只有一个大砂皮花,而DataSet中有49个小setosa花。

要进一步努力,我们可以从桌子上看出,特蕾莎花的尺寸比弗吉纳花更大。为了检查是否大小与物种显着相关,我们可以执行独立性的Chi-Square测试,因为这两个变量都是分类变量。看看如何进行此测试 用手在R..

Note that Species are in rows and size in column because we specified Species 和 then size in table(). Change the order if you want to switch the two variables.

Instead of having the frequencies (i.e.. the number of cases) you can also have the relative frequencies (i.e., proportions) in each subgroup by adding the table() 功能 inside the prop.table() function:

prop.table(table(dat$Species, dat$size))
##             
##                      big       small
##   setosa     0.006666667 0.326666667
##   versicolor 0.193333333 0.140000000
##   virginica  0.313333333 0.020000000

Note that you can also compute the percentages by row or by column by adding a second argument to the prop.table() function: 1 for row, or 2 for column:

# percentages by row:
round(prop.table(table(dat$Species, dat$size), 1), 2) # round to 2 digits with round()
##             
##               big small
##   setosa     0.02  0.98
##   versicolor 0.58  0.42
##   virginica  0.94  0.06
# percentages by column:
round(prop.table(table(dat$Species, dat$size), 2), 2) # round to 2 digits with round()
##             
##               big small
##   setosa     0.01  0.67
##   versicolor 0.38  0.29
##   virginica  0.61  0.04

看看 高级描述性统计数据 对于更先进的应急表。

马赛克情节

马赛克图允许可视化两个定性变量的应变表:

mosaicplot(table(dat$Species, dat$size),
  color = TRUE,
  xlab = "Species", # label for x-axis
  ylab = "Size" # label for y-axis
)

这 mosaic plot shows that, for our sample, the proportion of big and small flowers is clearly different between the three species. In particular, the virginica species is the biggest, and the setosa species is the smallest of the three species (in terms of sepal length since the variable size is based on the variable Sepal.Length).

For your information, a mosaic plot can also be done via the mosaic() 功能 from the {vcd} package:

library(vcd)

mosaic(~ Species + size,
  data = dat,
  direction = c("v", "h")
)

巴格特

Barplits只能在定性变量上完成(看与定量变量的差异 这里). A barplot is a tool to visualize the distribution of a qualitative variable. We draw a barplot of the qualitative variable size:

barplot(table(dat$size)) # table() is mandatory

You can also draw a barplot of the relative frequencies instead of the frequencies by adding prop.table() as we did earlier:

barplot(prop.table(table(dat$size)))

In {ggplot2}:

library(ggplot2) # needed each time you open RStudio
# The package ggplot2 must be installed first

ggplot(dat) +
  aes(x = size) +
  geom_bar()

直方图

A histogram gives an idea about the distribution of a quantitative variable. The idea is to break the range of values into intervals and count how many observations fall into each interval. Histograms are a bit similar to barplots, but histograms are used for quantitative variables whereas barplots are used for qualitative variables. To draw a histogram in R, use hist():

hist(dat$Sepal.Length)

Add the arguments breaks = inside the hist() 功能 if you want to change the number of bins. A rule of thumb (known as Sturges’ law) is that the number of bins should be the rounded value of the square root of the number of observations. The dataset includes 150 observations so in this case the number of bins can be set to 12.

In {ggplot2}:

ggplot(dat) +
  aes(x = Sepal.Length) +
  geom_histogram()

By default, the number of bins is 30. You can change this value with geom_histogram(bins = 12) for instance.

箱形图

箱形图s在描述性统计中非常有用,并且经常被削弱(主要是因为公众不太了解)。盒子图形地表示通过在视觉上显示五个常见位置概要(最小,中值,第一/第三个四分位数和最大值)以及归类为疑似的任何观察来分布定量变量的分布 异常值 使用句子范围(IQR)标准。 IQR标准意味着上面的所有观察 \(Q_ {0.75} + 1.5 \ CDOT IQR \) 或下面 \(Q_ {0.25} - 1.5 \ CDOT IQR \) (在哪里 \(Q_ {0.25} \)\(Q_ {0.75} \) 对应于第一和第三个四分位数)被认为是R的潜在异常值。Boxpot中的最小和最大值在没有这些疑似异常值的情况下表示。

在相同的绘图上看到所有这些信息有助于具有良好的分散概述和数据的位置。在绘制数据的Boxplot之前,请参阅下面的图表,解释了Boxplot上存在的信息:

如何解释一个boxplot?资料来源:LFSAB1105

现在是我们数据集的示例:

boxplot(dat$Sepal.Length)

当并排呈现以比较和对比两个或更多组的分布来更具信息丰富的信息。例如,我们将萼片的长度与不同物种进行比较:

boxplot(dat$Sepal.Length ~ dat$Species)

In {ggplot2}:

ggplot(dat) +
  aes(x = Species, y = Sepal.Length) +
  geom_boxplot()

dotplot.

dotplot比boxplot更少于或更少,除了观察结果为点,并且没有在绘图上呈现的摘要统计信息:

library(lattice)

dotplot(dat$Sepal.Length ~ dat$Species)

散点图

散点图允许检查两个定量变量之间是否存在潜在链接。因此,散点图通常用于可视化潜力 相关性 两个变量之间。例如,当绘制萼片长度的散点图和花瓣的长度:

plot(dat$Sepal.Length, dat$Petal.Length)

两个变量之间似乎是一个正相关的关联。

In {ggplot2}:

ggplot(dat) +
  aes(x = Sepal.Length, y = Petal.Length) +
  geom_point()

像Boxplots一样,在根据一个因素的区分点时,散点图更具信息量,在这种情况下,物种:

ggplot(dat) +
  aes(x = Sepal.Length, y = Petal.Length, colour = Species) +
  geom_point() +
  scale_color_hue()

线图

线图s, particularly useful in time series or finance, can be created by adding the type = "l" argument in the plot() function:

plot(dat$Sepal.Length,
  type = "l"
) # "l" for line

QQ图

对于单个变量

为了检查变量的正常假设(正常意味着数据遵循正常分布,也称为高斯分布),我们通常使用直方图和/或QQ图。1 看文章讨论了 正态分布以及如何评估r中的正常假设 如果您在该主题上需要刷新。直方图已经提前呈现,所以这里是如何绘制QQ图的:

# Draw points on the qq-plot:
qqnorm(dat$Sepal.Length)
# Draw the reference line:
qqline(dat$Sepal.Length)

Or a QQ-plot with confidence bands with the qqPlot() 功能 from the {car} package:

library(car) # package must be installed first
qqPlot(dat$Sepal.Length)

## [1] 132 118

If points are close to the reference line (sometimes referred as Henry’s line) and within the confidence bands, the normality assumption can be considered as met. The bigger the deviation between the points and the reference line and the more they lie outside the confidence bands, the less likely that the normality condition is met. The variable Sepal.Length does not seem to follow a normal distribution because several points lie outside the confidence bands. When facing a non-normal distribution, the first step is usually to apply the logarithm transformation on the data and recheck to see whether the log-transformed data are normally distributed. Applying the logarithm transformation can be done with the log() function.

In {ggpubr}:

library(ggpubr)
ggqqplot(dat$Sepal.Length)

按团体组成

For some statistical tests, the normality assumption is required in all groups. One solution is to draw a QQ-plot for each group by manually splitting the dataset into different groups and then draw a QQ-plot for each subset of the data (with the methods shown above). Another (easier) solution is to draw a QQ-plot for each group automatically with the argument groups = in the function qqPlot() from the {car} package:

qqPlot(dat$Sepal.Length, groups = dat$size)

In {ggplot2}:

qplot(
  sample = Sepal.Length, data = dat,
  col = size, shape = size
)

It is also possible to differentiate groups by only shape or color. For this, remove one of the argument col or shape in the qplot() 功能 above.

密度图

密度图 is a smoothed version of the histogram and is used in the same concept, that is, to represent the distribution of a numeric variable. The functions plot()density() are used together to draw a density plot:

plot(density(dat$Sepal.Length))

In {ggplot2}:

ggplot(dat) +
  aes(x = Sepal.Length) +
  geom_density()

相关图

最后一类描述曲线是相关图,也称为相关图。这种类型的图谱比上面呈现的图形更复杂,因此它在单独的文章中详述。看 如何绘制相关图以突出显示数据集中最相关的变量.

高级描述性统计数据

我们介绍了计算最常见和基本的描述性统计数据的主要功能。但是,许多功能和包在R中执行更高级的描述性统计信息。在本节中,我将其中一些应用于我们的数据集。

{summarytools} package

用于描述性统计数据的一个包,我经常在r中使用我的项目是 {summarytools} 包裹。包裹以4个功能为中心:

  1. freq() 对于频率表
  2. ctable() 对于交叉表格
  3. descr() 用于描述性统计数据
  4. dfSummary() 用于DataFrame摘要

对于大多数描述性分析,这4个功能的组合通常绰绰有余。而且,包装已建成 r markdown. 记住,含量在HTML报告中呈现出渲染。对于非英语扬声器,法国,葡萄牙语,西班牙语,俄语和土耳其语存在内置翻译。

我说明了以下部分中的4个功能中的每一个。在R Markdown报告中显示更好的输出,但在本文中,我将自己限制为原始输出,因为目标是展示函数如何工作,而不是如何使它们呈现。请参阅设置设置 小插图 如果您想以r Markdown以良好的方式打印输出,则包装。2

频率表 freq()

freq() 功能 produces frequency tables with frequencies, proportions, as well as missing data information.

library(summarytools)
freq(dat$Species)
## Frequencies  
## dat$Species  
## Type: Factor  
## 
##                    Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------- ------ --------- -------------- --------- --------------
##           setosa     50     33.33          33.33     33.33          33.33
##       versicolor     50     33.33          66.67     33.33          66.67
##        virginica     50     33.33         100.00     33.33         100.00
##             <NA>      0                               0.00         100.00
##            Total    150    100.00         100.00    100.00         100.00

If you do not need information about missing values, add the report.nas = FALSE argument:

freq(dat$Species,
  report.nas = FALSE # remove NA information
)
## Frequencies  
## dat$Species  
## Type: Factor  
## 
##                    Freq        %   % Cum.
## ---------------- ------ -------- --------
##           setosa     50    33.33    33.33
##       versicolor     50    33.33    66.67
##        virginica     50    33.33   100.00
##            Total    150   100.00   100.00

对于只有计数和比例的简约输出:

freq(dat$Species,
  report.nas = FALSE, # remove NA information
  totals = FALSE, # remove totals
  cumul = FALSE, # remove cumuls
  headings = FALSE # remove headings
)
## 
##                    Freq       %
## ---------------- ------ -------
##           setosa     50   33.33
##       versicolor     50   33.33
##        virginica     50   33.33

交叉表格与 ctable()

ctable() 功能 produces cross-tabulations (also known as contingency tables) for pairs of categorical variables. Using the two categorical variables in our dataset:

ctable(
  x = dat$Species,
  y = dat$size
)
## Cross-Tabulation, Row Proportions  
## Species * size  
## Data Frame: dat  
## 
## ------------ ------ ------------ ------------ --------------
##                size          big        small          Total
##      Species                                                
##       setosa           1 ( 2.0%)   49 (98.0%)    50 (100.0%)
##   versicolor          29 (58.0%)   21 (42.0%)    50 (100.0%)
##    virginica          47 (94.0%)    3 ( 6.0%)    50 (100.0%)
##        Total          77 (51.3%)   73 (48.7%)   150 (100.0%)
## ------------ ------ ------------ ------------ --------------

Row proportions are shown by default. To display column or total proportions, add the prop = "c" or prop = "t" arguments, respectively:

ctable(
  x = dat$Species,
  y = dat$size,
  prop = "t" # total proportions
)
## Cross-Tabulation, Total Proportions  
## Species * size  
## Data Frame: dat  
## 
## ------------ ------ ------------ ------------ --------------
##                size          big        small          Total
##      Species                                                
##       setosa           1 ( 0.7%)   49 (32.7%)    50 ( 33.3%)
##   versicolor          29 (19.3%)   21 (14.0%)    50 ( 33.3%)
##    virginica          47 (31.3%)    3 ( 2.0%)    50 ( 33.3%)
##        Total          77 (51.3%)   73 (48.7%)   150 (100.0%)
## ------------ ------ ------------ ------------ --------------

To remove proportions altogether, add the argument prop = "n". Furthermore, to display only the bare minimum, add the totals = FALSEheadings = FALSE arguments:

ctable(
  x = dat$Species,
  y = dat$size,
  prop = "n", # remove proportions
  totals = FALSE, # remove totals
  headings = FALSE # remove headings
)
## 
## ------------ ------ ----- -------
##                size   big   small
##      Species                     
##       setosa            1      49
##   versicolor           29      21
##    virginica           47       3
## ------------ ------ ----- -------

This is equivalent than table(dat$Species, dat$size)xtabs(~ dat$Species + dat$size) performed in the section on 应急表.

显示结果 Chi-Square独立性测试, add the chisq = TRUE argument:3

ctable(
  x = dat$Species,
  y = dat$size,
  chisq = TRUE, # display results of Chi-Square独立性测试
  headings = FALSE # remove headings
)
## 
## ------------ ------ ------------ ------------ --------------
##                size          big        small          Total
##      Species                                                
##       setosa           1 ( 2.0%)   49 (98.0%)    50 (100.0%)
##   versicolor          29 (58.0%)   21 (42.0%)    50 (100.0%)
##    virginica          47 (94.0%)    3 ( 6.0%)    50 (100.0%)
##        Total          77 (51.3%)   73 (48.7%)   150 (100.0%)
## ------------ ------ ------------ ------------ --------------
## 
## ----------------------------
##  Chi.squared   df   p.value 
## ------------- ---- ---------
##     86.03      2       0    
## ----------------------------

p-Value接近0,因此我们拒绝两个变量之间的独立性的零假设。在我们的上下文中,这表明物种和大小是依赖的,并且两个变量之间存在显着的关系。

It is also possible to create a contingency table for each level of a third categorical variable thanks to the combination of the stby()ctable() functions. There are only 2 categorical variables in our dataset, so let’s use the tabacco dataset which has 4 categorical variables (i.e., gender, age group, smoker, diseased). For this example, we would like to create a contingency table of the variables smokerdiseased, and this for each gender:

stby(list(
  x = tobacco$smoker, # smoker and diseased
  y = tobacco$diseased
),
INDICES = tobacco$gender, # for each gender
FUN = ctable # ctable for cross-tabulation
)
## Cross-Tabulation, Row Proportions  
## smoker * diseased  
## Data Frame: tobacco  
## Group: gender = F  
## 
## -------- ---------- ------------- ------------- --------------
##            diseased           Yes            No          Total
##   smoker                                                      
##      Yes               62 (42.2%)    85 (57.8%)   147 (100.0%)
##       No               49 (14.3%)   293 (85.7%)   342 (100.0%)
##    Total              111 (22.7%)   378 (77.3%)   489 (100.0%)
## -------- ---------- ------------- ------------- --------------
## 
## Group: gender = M  
## 
## -------- ---------- ------------- ------------- --------------
##            diseased           Yes            No          Total
##   smoker                                                      
##      Yes               63 (44.1%)    80 (55.9%)   143 (100.0%)
##       No               47 (13.6%)   299 (86.4%)   346 (100.0%)
##    Total              110 (22.5%)   379 (77.5%)   489 (100.0%)
## -------- ---------- ------------- ------------- --------------

描述性统计数据 descr()

descr() 功能 produces descriptive (univariate) statistics with common central tendency statistics and measures of dispersion. (See the 中央倾向和分散度量之间的差异 如果你需要提醒。)

此功能的主要优点是它接受单个向量以及数据帧。如果提供了数据帧,则忽略所有非数字列,以便在运行功能之前不必将自己删除。

descr() 功能 allows to display:

  • only a selection of descriptive statistics of your choice, with the stats = c("mean", "sd") argument for mean and standard deviation for example
  • 最低,第一个四分位数,中值,第三四分位数和最大值 stats = "fivenum"
  • the most common descriptive statistics (mean, standard deviation, minimum, median, maximum, number and percentage of valid observations), with stats = "common":
descr(dat,
  headings = FALSE, # remove headings
  stats = "common" # most common descriptive statistics
)
## 
##                   Petal.Length   Petal.Width   Sepal.Length   Sepal.Width
## --------------- -------------- ------------- -------------- -------------
##            Mean           3.76          1.20           5.84          3.06
##         Std.Dev           1.77          0.76           0.83          0.44
##             Min           1.00          0.10           4.30          2.00
##          Median           4.35          1.30           5.80          3.00
##             Max           6.90          2.50           7.90          4.40
##         N.Valid         150.00        150.00         150.00        150.00
##       Pct.Valid         100.00        100.00         100.00        100.00

小费: if you have a large number of variables, add the transpose = TRUE argument for a better display.

In order to compute these descriptive statistics by group (e.g., Species in our dataset), use the descr() 功能 in combination with the stby() function:

stby(
  data = dat,
  INDICES = dat$Species, # by Species
  FUN = descr, # descriptive statistics
  stats = "common" # most common descr. stats
)
## Descriptive Statistics  
## dat  
## Group: Species = setosa  
## N: 50  
## 
##                   Petal.Length   Petal.Width   Sepal.Length   Sepal.Width
## --------------- -------------- ------------- -------------- -------------
##            Mean           1.46          0.25           5.01          3.43
##         Std.Dev           0.17          0.11           0.35          0.38
##             Min           1.00          0.10           4.30          2.30
##          Median           1.50          0.20           5.00          3.40
##             Max           1.90          0.60           5.80          4.40
##         N.Valid          50.00         50.00          50.00         50.00
##       Pct.Valid         100.00        100.00         100.00        100.00
## 
## Group: Species = versicolor  
## N: 50  
## 
##                   Petal.Length   Petal.Width   Sepal.Length   Sepal.Width
## --------------- -------------- ------------- -------------- -------------
##            Mean           4.26          1.33           5.94          2.77
##         Std.Dev           0.47          0.20           0.52          0.31
##             Min           3.00          1.00           4.90          2.00
##          Median           4.35          1.30           5.90          2.80
##             Max           5.10          1.80           7.00          3.40
##         N.Valid          50.00         50.00          50.00         50.00
##       Pct.Valid         100.00        100.00         100.00        100.00
## 
## Group: Species = virginica  
## N: 50  
## 
##                   Petal.Length   Petal.Width   Sepal.Length   Sepal.Width
## --------------- -------------- ------------- -------------- -------------
##            Mean           5.55          2.03           6.59          2.97
##         Std.Dev           0.55          0.27           0.64          0.32
##             Min           4.50          1.40           4.90          2.20
##          Median           5.55          2.00           6.50          3.00
##             Max           6.90          2.50           7.90          3.80
##         N.Valid          50.00         50.00          50.00         50.00
##       Pct.Valid         100.00        100.00         100.00        100.00

数据帧摘要与 dfSummary()

dfSummary() 功能 generates a summary table with statistics, frequencies and graphs for all variables in a dataset. The information shown depends on the type of the variables (character, factor, numeric, date) and also varies according to the number of distinct values.

dfSummary(dat)
## Data Frame Summary  
## dat  
## Dimensions: 150 x 6  
## Duplicates: 1  
## 
## ----------------------------------------------------------------------------------------------------------------------
## No   Variable        Stats / Values           Freqs (% of Valid)   Graph                            Valid    Missing  
## ---- --------------- ------------------------ -------------------- -------------------------------- -------- ---------
## 1    Sepal.Length    Mean (sd) : 5.8 (0.8)    35 distinct values     . . : :                        150      0        
##      [numeric]       min < med < max:                                : : : :                        (100%)   (0%)     
##                      4.3 < 5.8 < 7.9                                 : : : : :                                        
##                      IQR (CV) : 1.3 (0.1)                            : : : : :                                        
##                                                                    : : : : : : : :                                    
## 
## 2    Sepal.Width     Mean (sd) : 3.1 (0.4)    23 distinct values           :                        150      0        
##      [numeric]       min < med < max:                                      :                        (100%)   (0%)     
##                      2 < 3 < 4.4                                         . :                                          
##                      IQR (CV) : 0.5 (0.1)                              : : : :                                        
##                                                                    . . : : : : : :                                    
## 
## 3    Petal.Length    Mean (sd) : 3.8 (1.8)    43 distinct values   :                                150      0        
##      [numeric]       min < med < max:                              :         . :                    (100%)   (0%)     
##                      1 < 4.3 < 6.9                                 :         : : .                                    
##                      IQR (CV) : 3.5 (0.5)                          : :       : : : .                                  
##                                                                    : :   . : : : : : .                                
## 
## 4    Petal.Width     Mean (sd) : 1.2 (0.8)    22 distinct values   :                                150      0        
##      [numeric]       min < med < max:                              :                                (100%)   (0%)     
##                      0.1 < 1.3 < 2.5                               :       . .   :                                    
##                      IQR (CV) : 1.5 (0.6)                          :       : :   :   .                                
##                                                                    : :   : : : . : : :                                
## 
## 5    Species         1. setosa                50 (33.3%)           IIIIII                           150      0        
##      [factor]        2. versicolor            50 (33.3%)           IIIIII                           (100%)   (0%)     
##                      3. virginica             50 (33.3%)           IIIIII                                             
## 
## 6    size            1. big                   77 (51.3%)           IIIIIIIIII                       150      0        
##      [character]     2. small                 73 (48.7%)           IIIIIIIII                        (100%)   (0%)     
## ----------------------------------------------------------------------------------------------------------------------

describeBy() from the {psych} package

describeBy() 功能 from the {psych} 包裹 allows to report several summary statistics (i.e., number of valid cases, mean, standard deviation, median, trimmed mean, mad: median absolute deviation (from the median), minimum, maximum, range, skewness and kurtosis) by a grouping variable.

library(psych)
describeBy(
  dat,
  dat$Species # grouping variable
)
## 
##  描述性统计 by group 
## group: setosa
##              vars  n mean   sd median trimmed  mad min  max range skew kurtosis
## Sepal.Length    1 50 5.01 0.35    5.0    5.00 0.30 4.3  5.8   1.5 0.11    -0.45
## Sepal.Width     2 50 3.43 0.38    3.4    3.42 0.37 2.3  4.4   2.1 0.04     0.60
## Petal.Length    3 50 1.46 0.17    1.5    1.46 0.15 1.0  1.9   0.9 0.10     0.65
## Petal.Width     4 50 0.25 0.11    0.2    0.24 0.00 0.1  0.6   0.5 1.18     1.26
## Species*        5 50 1.00 0.00    1.0    1.00 0.00 1.0  1.0   0.0  NaN      NaN
## size*           6 50  NaN   NA     NA     NaN   NA Inf -Inf  -Inf   NA       NA
##                se
## Sepal.Length 0.05
## Sepal.Width  0.05
## Petal.Length 0.02
## Petal.Width  0.01
## Species*     0.00
## size*          NA
## ------------------------------------------------------------ 
## group: versicolor
##              vars  n mean   sd median trimmed  mad min  max range  skew
## Sepal.Length    1 50 5.94 0.52   5.90    5.94 0.52 4.9  7.0   2.1  0.10
## Sepal.Width     2 50 2.77 0.31   2.80    2.78 0.30 2.0  3.4   1.4 -0.34
## Petal.Length    3 50 4.26 0.47   4.35    4.29 0.52 3.0  5.1   2.1 -0.57
## Petal.Width     4 50 1.33 0.20   1.30    1.32 0.22 1.0  1.8   0.8 -0.03
## Species*        5 50 2.00 0.00   2.00    2.00 0.00 2.0  2.0   0.0   NaN
## size*           6 50  NaN   NA     NA     NaN   NA Inf -Inf  -Inf    NA
##              kurtosis   se
## Sepal.Length    -0.69 0.07
## Sepal.Width     -0.55 0.04
## Petal.Length    -0.19 0.07
## Petal.Width     -0.59 0.03
## Species*          NaN 0.00
## size*              NA   NA
## ------------------------------------------------------------ 
## group: virginica
##              vars  n mean   sd median trimmed  mad min  max range  skew
## Sepal.Length    1 50 6.59 0.64   6.50    6.57 0.59 4.9  7.9   3.0  0.11
## Sepal.Width     2 50 2.97 0.32   3.00    2.96 0.30 2.2  3.8   1.6  0.34
## Petal.Length    3 50 5.55 0.55   5.55    5.51 0.67 4.5  6.9   2.4  0.52
## Petal.Width     4 50 2.03 0.27   2.00    2.03 0.30 1.4  2.5   1.1 -0.12
## Species*        5 50 3.00 0.00   3.00    3.00 0.00 3.0  3.0   0.0   NaN
## size*           6 50  NaN   NA     NA     NaN   NA Inf -Inf  -Inf    NA
##              kurtosis   se
## Sepal.Length    -0.20 0.09
## Sepal.Width      0.38 0.05
## Petal.Length    -0.37 0.08
## Petal.Width     -0.75 0.04
## Species*          NaN 0.00
## size*              NA   NA

aggregate() function

aggregate() 功能 allows to split the data into subsets and then to compute summary statistics for each. For instance, if we want to compute the mean for the variables Sepal.LengthSepal.Width by SpeciesSize:

aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species + size,
  data = dat,
  mean
)
##      Species  size Sepal.Length Sepal.Width
## 1     setosa   big     5.800000    4.000000
## 2 versicolor   big     6.282759    2.868966
## 3  virginica   big     6.663830    2.997872
## 4     setosa small     4.989796    3.416327
## 5 versicolor small     5.457143    2.633333
## 6  virginica small     5.400000    2.600000

谢谢阅读。我希望这篇文章帮助您在R中做描述性统计数据。如果您想用手做同样的事情或了解这些统计数据所代表,我邀请您阅读文章“手工描述性统计“。

一如既往,如果您有问题或与本文所涵盖的主题相关的建议,请将其添加为评论,以便其他读者可以从讨论中受益。


  1. 诸如Shapiro-Wilk或Kolmogorov-Smirnov测试的正常性测试也可用于测试数据是否遵循正常分布。然而,在实践中,正常测试通常被认为是过于保守的,对于大样本大小,与正常性的小偏差可能导致违反正常状况。因此,通常是基于视觉检查的组合(具有直方图和QQ-PLOTS)和正式测试(例如,Shapiro-Wilk测试)来验证正常性条件的情况。↩︎

  2. Note that the plain.asciistyle arguments are needed for this package. In our examples, these arguments are added in the settings of each chunk so they are not visible.↩︎

  3. 注意,也可以计算赔率比和风险比。有关此问题的更多信息,请参阅包装的小插图,因为这些比率超出了本文的范围。↩︎



喜欢这篇文章?

获取更新 每次发布新文章。
任何垃圾邮件都没有任何垃圾邮件。
分享: