r中的齐鲁风采群英会操纵

Antoine Soetewey 2019-12-24 37 minute read

介绍

并非所有齐鲁风采群英会帧都是清洁和整洁的。因此,之后 将齐鲁风采群英会集导入Rstudio,大部分时间都需要在执行任何统计分析之前准备它。当齐鲁风采群英会质量差时,齐鲁风采群英会操纵甚至可能需要长于实际分析。

齐鲁风采群英会操作包括广泛的工具和技术。我们在此处介绍了您最有可能在R的操纵中的操纵。如果您发现其他齐鲁风采群英会操作是必要的,请不要犹豫,让我知道(作为本文末尾的评论)添加它们。

在本文中,我们显示了在R中操纵齐鲁风采群英会的主要功能。我们首先说明了向量上的这些功能, 因素 和名单。然后,我们说明了在R中操纵齐鲁风采群英会帧和日期/次的主要功能。

vectors.

级联

We can concatenate (i.e., combine) numbers or strings with c():

c(2, 4, -1)
## [1]  2  4 -1
c(1, 5 / 6, 2^3, -0.05)
## [1]  1.0000000  0.8333333  8.0000000 -0.0500000

Note that by default R displays 7 decimals. You can modify it with options(digits = 2) (two decimals).

也可以创建一个连续的序列 整数:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# is the same than
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10
# or
c(1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10

seq()rep()

seq() 允许使由序列定义的向量。您可以选择递增:

seq(from = 2, to = 5, by = 0.5)
## [1] 2.0 2.5 3.0 3.5 4.0 4.5 5.0

或它的长度:

seq(from = 2, to = 5, length.out = 7)
## [1] 2.0 2.5 3.0 3.5 4.0 4.5 5.0

On the other hand, rep() creates a vector which is the repetition of numbers or strings:

rep(1, times = 3)
## [1] 1 1 1
rep(c("A", "B", "C"), times = c(3, 1, 2))
## [1] "A" "A" "A" "B" "C" "C"

您还可以创建一个是重复数字和字符串的向量:

rep(c("A", 2, "C"), times = c(3, 1, 2))
## [1] "A" "A" "A" "2" "C" "C"

但在这种情况下,第2号码也将被视为一个字符串(而不是作为一个字符串 数字)由于载体中至少有一个字符串。

任务

有三种方法可以在R中分配一个对象:

  1. <-
  2. =
  3. assign()
# 1st method
x <- c(2.1, 5, -4, 1, 5)
x
## [1]  2.1  5.0 -4.0  1.0  5.0
# 2nd method
x2 <- c(2.1, 5, -4, 1, 5)
x2
## [1]  2.1  5.0 -4.0  1.0  5.0
# 3rd method (less common)
assign("x3", c(2.1, 5, -4, 1, 5))
x3
## [1]  2.1  5.0 -4.0  1.0  5.0

您还可以将向量分配给另一个向量,例如:

y <- c(x, 10, 1 / 4)
y
## [1]  2.10  5.00 -4.00  1.00  5.00 10.00  0.25

传染媒介的元素

我们可以通过指定方括号之间的位置来选择矢量的一个或多个元素:

# select one element
x[3]
## [1] -4
# select more than one element with c()
x[c(1, 3, 4)]
## [1]  2.1 -4.0  1.0

Note that in R the numbering of the indices starts at 1 (and no 0 like other programming languages) so x[1] gives the first element of the vector x.

我们也可以使用 布尔斯 (i.e., TRUE 或者 FALSE) to select some elements of a vector. This method selects only the elements corresponding to TRUE:

x[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
## [1]  2.1 -4.0  1.0

或者我们可以为这些要素提取:

x[-c(2, 4)]
## [1]  2.1 -4.0  5.0

类型和长度

矢量的主要类型是 数字, 逻辑特点。有关每种类型的更多详细信息,请查看不同的 r的齐鲁风采群英会类型.

class() 给出矢量类型:

x <- c(2.1, 5, -4, 1, 5, 0)
class(x)
## [1] "numeric"
y <- c(x, "Hello")
class(y)
## [1] "character"

如上所述,只有当所有元素都是数字时,向量的类才是数字。一旦一个元素是一个角色,矢量的类将是一个角色。

z <- c(TRUE, FALSE, FALSE)
class(z)
## [1] "logical"

length() 给出载体的长度:

length(x)
## [1] 6

So to select the last element of a vector (in a dynamic way), we can use a combination of length()[]:

x[length(x)]
## [1] 0

找到矢量类型

We can find the type of a vector with the family of is.type functions:

is.numeric(x)
## [1] TRUE
is.logical(x)
## [1] FALSE
is.character(x)
## [1] FALSE

Or in a more generic way with the is() function:

is(x)
## [1] "numeric" "vector"

改进类型和长度

We can change the type of a vector with the as.numeric(), as.logical()as.character() functions:

x_character <- as.character(x)
x_character
## [1] "2.1" "5"   "-4"  "1"   "5"   "0"
is.character(x_character)
## [1] TRUE
x_logical <- as.logical(x)
x_logical
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
is.logical(x_logical)
## [1] TRUE

也可以改变其长度:

length(x) <- 4
x
## [1]  2.1  5.0 -4.0  1.0

如您所见,传染媒介的第一个元素是保存的,而其他其他则被删除。在这种情况下,前4个我们指定了4的长度。

数值运营商

The basic numerical operators such as +, -, *, /^ can be applied to vectors:

x <- c(2.1, 5, -4, 1)
y <- c(0, -7, 1, 1 / 4)

x + y
## [1]  2.10 -2.00 -3.00  1.25
x * y
## [1]   0.00 -35.00  -4.00   0.25
x^y
## [1]  1.00e+00  1.28e-05 -4.00e+00  1.00e+00

也可以计算 最小,最大值矢量的总和,产品,累积和累积产品:

min(x)
## [1] -4
max(x)
## [1] 5
sum(x)
## [1] 4.1
prod(x)
## [1] -42
cumsum(x)
## [1] 2.1 7.1 3.1 4.1
cumprod(x)
## [1]   2.1  10.5 -42.0 -42.0

可以应用以下数学运算:

  • sqrt() (square root)
  • cos() (cosine)
  • sin() (sine)
  • tan() (tangent)
  • log() (logarithm)
  • log10() (base 10 logarithm)
  • exp() (exponential)
  • abs() (absolute value)
cos(x)
## [1] -0.5048461  0.2836622 -0.6536436  0.5403023
exp(x)
## [1]   8.16616991 148.41315910   0.01831564   2.71828183

If you need to round a number, you can use the round(), floor()ceiling() functions:

round(cos(x), digits = 3) # 3 decimals
## [1] -0.505  0.284 -0.654  0.540
floor(cos(x)) # largest integer not greater than x
## [1] -1  0 -1  0
ceiling(cos(x)) # smallest integer not less than x
## [1] 0 1 0 1

逻辑运营商

R中最常见的逻辑运算符是:

  • 否定: !
  • Comparisons: <, <=, >=, >, == (equality), != (difference)
  • 和: &
  • 或者: |
x
## [1]  2.1  5.0 -4.0  1.0
x <= c(1, 6, 3, 4)
## [1] FALSE  TRUE  TRUE  TRUE
x <= 1
## [1] FALSE FALSE  TRUE  TRUE
(x == 1 | x > 4)
## [1] FALSE  TRUE FALSE  TRUE
!(x == 1 | x > 4)
## [1]  TRUE FALSE  TRUE FALSE

all()any()

As the names suggest, all() return TRUE if conditions are met for all elements, whereas any() returns TRUE if conditions are met for any of the element of a vector:

x
## [1]  2.1  5.0 -4.0  1.0
x <= 1
## [1] FALSE FALSE  TRUE  TRUE
all(x <= 1)
## [1] FALSE
any(x <= 1)
## [1] TRUE

字符串矢量的操作

您可以一起粘贴至少两个向量:

code <- paste(c("BE", "BE", "FR", "EN", "BE"), 1:5, sep = "/")
code
## [1] "BE/1" "BE/2" "FR/3" "EN/4" "BE/5"

The argument sep stands for separator 和 allows to specify the character(s) or symbol(s) used to separate each character strings.

If you do not want to specify a separator, you can use sep = "" 或者 the paste0() function:

paste(c("BE", "BE", "FR", "EN", "BE"), 1:5, sep = "")
## [1] "BE1" "BE2" "FR3" "EN4" "BE5"
paste0(c("BE", "BE", "FR", "EN", "BE"), 1:5)
## [1] "BE1" "BE2" "FR3" "EN4" "BE5"

To find the positions of the elements containing a given string, use the grep() function:

grep("BE", code)
## [1] 1 2 5

To extract a character string based on the beginning and the end positions, we can use the substr() function:

substr(code,
  start = 1,
  stop = 3
) # extract characters 1 to 3
## [1] "BE/" "BE/" "FR/" "EN/" "BE/"

Replace a character string by another one if it exists in the vector by using the sub() function:

sub(
  pattern = "BE", # find BE
  replacement = "BEL", # replace it with BEL
  code
)
## [1] "BEL/1" "BEL/2" "FR/3"  "EN/4"  "BEL/5"

Split a character string based on a specific symbol with the strsplit() function:

strsplit(c("Rafael Nadal", "Roger Federer", "Novak Djokovic"),
  split = " "
)
## [[1]]
## [1] "Rafael" "Nadal" 
## 
## [[2]]
## [1] "Roger"   "Federer"
## 
## [[3]]
## [1] "Novak"    "Djokovic"
strsplit(code,
  split = "/"
)
## [[1]]
## [1] "BE" "1" 
## 
## [[2]]
## [1] "BE" "2" 
## 
## [[3]]
## [1] "FR" "3" 
## 
## [[4]]
## [1] "EN" "4" 
## 
## [[5]]
## [1] "BE" "5"

将字符向量转换为大写和小写:

到 upper(c("Rafael Nadal", "Roger Federer", "Novak Djokovic"))
## [1] "RAFAEL NADAL"   "ROGER FEDERER"  "NOVAK DJOKOVIC"
到 lower(c("Rafael Nadal", "Roger Federer", "Novak Djokovic"))
## [1] "rafael nadal"   "roger federer"  "novak djokovic"

订单和向量

我们可以将传染媒介的元素从最小到最大,或者从最大到最小排序:

x <- c(2.1, 5, -4, 1, 1)
sort(x) # smallest to largest
## [1] -4.0  1.0  1.0  2.1  5.0
sort(x, decreasing = TRUE) # largest to smallest
## [1]  5.0  2.1  1.0  1.0 -4.0

或者der() 给出允许应用于向量,以便对其元素进行排序:

或者der(x)
## [1] 3 4 5 1 2

如您所见,向量的第三个元素是最小,第二个元素是最大的。这由输出开头的3表示,并且在输出结束时的2。

Like sort() the decreasing = TRUE argument can also be added:

或者der(x, decreasing = TRUE)
## [1] 2 1 4 5 3

在这种情况下,输出中的2表示矢量的第二个元素是最大的,而3表示第三元素是最小的。

rank() 给出元素的级别:

rank(x)
## [1] 4.0 5.0 1.0 2.5 2.5

向量的两个最后一个元素的排名为2.5,因为它们是平等的,并且他们在第一个之外到了第四等级之后。

我们还可以扭转元素(从最后一个到第一个):

x
## [1]  2.1  5.0 -4.0  1.0  1.0
rev(x)
## [1]  1.0  1.0 -4.0  5.0  2.1

因素

R.的因素 是否具有级别列表的向量,也称为类别。因素是有用的 定性的 齐鲁风采群英会如性别,民事状况,眼睛颜色等。

创造因素

We create factors with the factor() function (do not forget the c()):

f1 <- factor(c("T1", "T3", "T1", "T2"))
f1
## [1] T1 T3 T1 T2
## Levels: T1 T2 T3

我们当然可以从现有的向量中创建一个因素:

v <- c(1, 1, 0, 1, 0)
v2 <- factor(v,
  levels = c(0, 1),
  labels = c("bad", "good")
)
v2
## [1] good good bad  good bad 
## Levels: bad good

We can also specify that the levels are ordered by adding the 或者dered = TRUE argument:

v2 <- factor(v,
  levels = c(0, 1),
  labels = c("bad", "good"),
  ordered = TRUE
)
v2
## [1] good good bad  good bad 
## Levels: bad < good

Note that the order of the levels will follow the order that is specified in the labels argument.

特性

要了解级别的名称:

levels(f1)
## [1] "T1" "T2" "T3"

对于级别的数量:

nlevels(f1)
## [1] 3

In R, the first level is always the reference level. This reference level can be modified with relevel():

relevel(f1, ref = "T3")
## [1] T1 T3 T1 T2
## Levels: T3 T1 T2

你看到“T3”现在是第一个,由此参考等级。改变参考水平对它们的统计分析中显示或治疗的顺序产生了影响。比较,例如, Boxplots. 具有不同的参考水平。

处理

要了解每个级别的频率:

table(f1)
## f1
## T1 T2 T3 
##  2  1  1
# or
summary(f1)
## T1 T2 T3 
##  2  1  1

Note that the relative frequencies (i.e., the proportions) can be found with the combination of prop.table()table() 或者 summary():

prop.table(table(f1))
## f1
##   T1   T2   T3 
## 0.50 0.25 0.25
# or
prop.table(summary(f1))
##   T1   T2   T3 
## 0.50 0.25 0.25

Remember that a factor is coded in R as a numeric vector even though it looks like a character one. We can transform a factor into its numerical equivalent with the as.numeric() function:

f1
## [1] T1 T3 T1 T2
## Levels: T1 T2 T3
as.numeric(f1)
## [1] 1 3 1 2

And a numeric vector can be transformed into a factor with the as.factor() 或者 factor() function:

num <- 1:4
fac <- as.factor(num)
fac
## [1] 1 2 3 4
## Levels: 1 2 3 4
fac2 <- factor(num)
fac2
## [1] 1 2 3 4
## Levels: 1 2 3 4

The advantage of factor() is that it is possible to specify a name for each level:

fac2 <- factor(num,
  labels = c("bad", "neutral", "good", "very good")
)
fac2
## [1] bad       neutral   good      very good
## Levels: bad neutral good very good

清单

列表是一个矢量,其元素可以是不同的自然:矢量,列表,因子,数字或字符等。

创建列表

The function list() allows to create lists:

tahiti <- list(
  plane = c("Airbus", "Boeing"),
  departure = c("Brussels", "Milan", "Paris"),
  duration = c(15, 11, 14)
)
tahiti
## $plane
## [1] "Airbus" "Boeing"
## 
## $departure
## [1] "Brussels" "Milan"    "Paris"   
## 
## $duration
## [1] 15 11 14

处理

有几种方法从列表中提取元素:

tahiti$departure
## [1] "Brussels" "Milan"    "Paris"
# or
tahiti$de
## [1] "Brussels" "Milan"    "Paris"
# or
tahiti[[2]]
## [1] "Brussels" "Milan"    "Paris"
# or
tahiti[["departure"]]
## [1] "Brussels" "Milan"    "Paris"
tahiti[[2]][c(1, 2)]
## [1] "Brussels" "Milan"

将列表转换为向量:

v <- unlist(tahiti)
v
##     plane1     plane2 departure1 departure2 departure3  duration1  duration2 
##   "Airbus"   "Boeing" "Brussels"    "Milan"    "Paris"       "15"       "11" 
##  duration3 
##       "14"
is.vector(v)
## [1] TRUE

获取对象的详细信息

attributes() 给出元素的名称(它可以在每个R对象上使用):

attributes(tahiti)
## $names
## [1] "plane"     "departure" "duration"

str() 给出关于元素的简短描述(它也可以在每个R对象上使用):

str(tahiti)
## List of 3
##  $ plane    : chr [1:2] "Airbus" "Boeing"
##  $ departure: chr [1:3] "Brussels" "Milan" "Paris"
##  $ duration : num [1:3] 15 11 14

齐鲁风采群英会框架

R中的每个导入文件都是齐鲁风采群英会帧(至少如果您不使用包 在r中导入齐鲁风采群英会)。齐鲁风采群英会帧是列表和矩阵的混合:它具有矩阵的形状,但列可以具有不同的类。

请记住,金标准 齐鲁风采群英会帧 is that:

  • 代表 变量
  • 线 相当于 观察 and
  • 每个 价值 必须有自己的 细胞

齐鲁风采群英会帧的结构。资料来源:哈德利威克姆的齐鲁风采群英会科学& Garrett Grolemund

In this article, we use the data frame cars 到 illustrate the main data manipulation techniques. Note that the data frame is installed by default in RStudio (so you do not need to import it) and I use the generic name dat as the name of the data frame throughout the article (see 这里 为什么我总是使用通用名称而不是更具体的名称)。

以下是整个齐鲁风采群英会框:

dat <- cars # rename the cars data frame with a generic name
dat # display the entire data frame
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

This data frame has 50 observations with 2 variables (speed距离).

You can check the number of observations and variables with nrow()ncol() respectively, or both at the same time with dim():

nrow(dat) # number of rows/observations
## [1] 50
ncol(dat) # number of columns/variables
## [1] 2
dim(dat) # dimension: number of rows and number of columns
## [1] 50  2

行和列名称

在操纵齐鲁风采群英会帧之前,有趣了解行和列名称:

dimnames(dat)
## [[1]]
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [46] "46" "47" "48" "49" "50"
## 
## [[2]]
## [1] "speed" "dist"

只知道列名:

names(dat)
## [1] "speed" "dist"
# or
colnames(dat)
## [1] "speed" "dist"

并且只知道行名称:

rownames(dat)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [46] "46" "47" "48" "49" "50"

子集齐鲁风采群英会帧

第一个或最后一个观察

  • 只保留前10个观察:
head(dat, n = 10)
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
  • 只保留最后5个观察:
tail(dat, n = 5)
##    speed dist
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

随机观察样本

  • 在不替换的情况下绘制4个观察的样本:
library(dplyr)
sample_n(dat, 4, replace = FALSE)
##   speed dist
## 1    14   26
## 2    18   76
## 3    19   46
## 4    16   32

基于行或列号

如果您知道要保留的观察或列,则可以使用行或列编号来子集齐鲁风采群英会帧。我们用几个例子说明了这个例子:

  • 保留所有变量 \(3 ^ {rd} \) observation:
dat[3, ]
  • 保持 \(2 ^ {nd} \) 所有观察的变量:
dat[, 2]
  • 你可以混合上述两个方法来保持 \(2 ^ {nd} \) 可变的 \(3 ^ {rd} \) observation:
dat[3, 2]
## [1] 4
  • 保持几个观察;例如观察 \(1\)\(5\), 这 \(10 ^ {th} \)\(15 ^ {th} \) 对所有变量的观察:
dat[c(1:5, 10, 15), ] # do not forget c()
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 10    11   17
## 15    12   28
  • 删除5至45:
dat[-c(5:45), ]
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85
  • tip: to keep only the last observation, use nrow() instead of the row number:
dat[nrow(dat), ] # nrow() gives the number of rows
##    speed dist
## 50    25   85

这样,无论观察人数,你都会始终选择最后一个。这种使用代码而不是特定值的技术是避免“硬编码”。通常不建议使用硬编码(除非您想指定您确定的参数,除非您确定的参数永远不会更改),因为如果您的齐鲁风采群英会框更改,则需要手动编辑代码。

As you probably figured out by now, you can select observations and/or variables of a dataset by running dataset_name[row_number, column_number]. When the row (column) number is left empty, the entire row (column) is selected.

请注意,上面提出的所有示例也适用于矩阵:

mat <- matrix(c(-1, 2, 0, 3), ncol = 2, nrow = 2)
mat
##      [,1] [,2]
## [1,]   -1    0
## [2,]    2    3
mat[1, 2]
## [1] 0

基于变量名称

To select one variable of the dataset based on its name rather than on its column number, use dataset_name$variable_name:

dat$speed
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

将使用此第二种方法访问齐鲁风采群英会帧内的变量与第一个方法相比,如果您打算修改齐鲁风采群英会库的结构,则强烈建议。实际上,如果在齐鲁风采群英会框中添加或删除列,则编号将更改。因此,变量通常由其名称而不是其位置(列数)。此外,更容易理解和解释代码与编写的变量的名称(另一个原因呼叫具有简洁而明确的名称的变量)。我仍然只使用列号的原因;如果预期变量名称在齐鲁风采群英会帧的结构不会改变时会更改。

To select variables, it is also possible to use the select() command from the powerful dplyr package (for compactness only the first 6 observations are displayed thanks to the head() command):

head(select(dat, speed))
##   speed
## 1     4
## 2     4
## 3     7
## 4     7
## 5     8
## 6     9

这相当于删除距离变量:

head(select(dat, -dist))
##   speed
## 1     4
## 2     4
## 3     7
## 4     7
## 5     8
## 6     9

基于一个或多个标准

而不是基于行/列号或变量名称子集齐鲁风采群英会库,而是您还可以基于一个或多个标准子集:

  • 仅使用大于20的速度保持观察。第一个参数是指齐鲁风采群英会帧的名称,而第二个参数是指子集标准:
subset(dat, dat$speed > 20)
##    speed dist
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85
  • 只能保持小于或等于50的距离的观察 speed equal to 10. Note the == (and not =) for the equal criteria:
subset(dat, dat$dist <= 50 & dat$speed == 10)
##   speed dist
## 7    10   18
## 8    10   26
## 9    10   34
  • use | 到 keep only observations with distance smaller than 20 或者 speed equal to 10:
subset(dat, dat$dist < 20 | dat$speed == 10)
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 12    12   14
  • 到 filter out some observations, use !=. For instance, to keep observations with speed not equal to 24 and distance not equal to 120 (for compactness only the last 6 observations are displayed thanks to the tail() command):
tail(subset(dat, dat$speed != 24 & dat$dist != 120))
##    speed dist
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 50    25   85

Note that it is also possible to subset a data frame with split():

split(dat, dat$factor_variable)

上面的代码将齐鲁风采群英会框架拆分为几个列表,一个用于因子变量的每个级别。

创建一个新变量

通常,通过基于来自初始齐鲁风采群英会帧的其他变量创建新变量,或者只是通过手动添加新变量,可以增强齐鲁风采群英会帧。

In this example, we create two new variables; one being the speed times the distance (which we call speed_dist.) and the other being a categorization of the speed (which we call speed_cat). We then display the first 6 observations of this new data frame with the 4 variables:

# create new variable speed_dist
dat$speed_dist <- dat$speed * dat$dist

# create new variable speed_cat
# with ifelse(): if dat$speed > 7, then speed_cat is "high speed", otherwise it is "low_speed"
dat$speed_cat <- factor(ifelse(dat$speed > 7,
  "high speed", "low speed"
))

# display first 6 observations
head(dat) # 6 is the default in head()
##   speed dist speed_dist  speed_cat
## 1     4    2          8  low speed
## 2     4   10         40  low speed
## 3     7    4         28  low speed
## 4     7   22        154  low speed
## 5     8   16        128 high speed
## 6     9   10         90 high speed

Note than in programming, a character string is generally surrounded by quotes (e.g., "特点 string") and R is not an exception.

将连续变量转换为分类变量

改变A. 连续变量 进入一个分类变量(也称为 定性变量):

dat$speed_quali <- cut(dat$speed,
  breaks = c(0, 12, 15, 19, 26), # cut points
  right = FALSE # closed on the left, open on the right
)

dat[c(1:2, 23:24, 49:50), ] # display some observations
##    speed dist speed_dist  speed_cat speed_quali
## 1      4    2          8  low speed      [0,12)
## 2      4   10         40  low speed      [0,12)
## 23    14   80       1120 high speed     [12,15)
## 24    15   20        300 high speed     [15,19)
## 49    24  120       2880 high speed     [19,26)
## 50    25   85       2125 high speed     [19,26)

当年龄(连续变量)转换为代表不同年龄组的定性变量时,该转换通常在年龄上进行。

总和和行中的均值

在李克特量表(在心理学中使用)的调查中,我们通常需要根据多个问题计算每个受访者的分数。分数通常是 意思是 或者所有兴趣问题的总和。

This can be done with rowMeans()rowSums(). For instance, let’s compute the mean and the sum of the variables speed, distspeed_dist. (variables must be numeric of course as a sum and a mean cannot be computed on qualitative variables!) for each row and store them under the variables 意思是_score到 tal_score:

dat$mean_score <- rowMeans(dat[, 1:3]) # variables speed, dist and speed_dist correspond to variables 1 to 3
dat$total_score <- rowSums(dat[, 1:3])

head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109

列中的总和和均值

也可以计算 mean and sum by column with colMeans()colSums():

colMeans(dat[, 1:3])
##      speed       dist speed_dist 
##      15.40      42.98     769.64
colSums(dat[, 1:3])
##      speed       dist speed_dist 
##        770       2149      38482

这相当于:

意思是(dat$speed)
## [1] 15.4
sum(dat$speed)
## [1] 770

但它一次允许为几个变量进行多个变量。

分类变量和标签管理

对于分类变量,它是使用因子格式并命名变量的不同级别的良好做法。

  • for this example, let’s create another new variable called dist_cat. based on the distance and then change its format from numeric to factor (while also specifying the labels of the levels):
# create new variable dist_cat
dat$dist_cat <- ifelse(dat$dist < 15,
  1, 2
)

# change from numeric to factor and specify the labels
dat$dist_cat <- factor(dat$dist_cat,
  levels = c(1, 2),
  labels = c("small distance", "big distance") # follow the order of the levels
)

head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109
##         dist_cat
## 1 small distance
## 2 small distance
## 3 small distance
## 4   big distance
## 5   big distance
## 6 small distance
  • 要检查变量的格式:
class(dat$dist_cat)
## [1] "factor"
# or
str(dat$dist_cat)
##  Factor w/ 2 levels "small distance",..: 1 1 1 2 2 1 2 2 2 2 ...

This will be sufficient if you need to format only a limited number of variables. However, if you need to do it for a large amount of categorical variables, it quickly becomes time consuming to write the same code many times. As you can imagine, it possible to format many variables without having to write the entire code for each variable one by one by using the within() command:

dat <- within(dat, {
  speed_cat <- factor(speed_cat, labels = c(
    "high speed",
    "low speed"
  ))
  dist_cat <- factor(dist_cat, labels = c(
    "small distance",
    "big distance"
  ))
})

head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109
##         dist_cat
## 1 small distance
## 2 small distance
## 3 small distance
## 4   big distance
## 5   big distance
## 6 small distance
str(dat)
## 'data.frame':    50 obs. of  8 variables:
##  $ speed      : num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist       : num  2 10 4 22 16 10 18 26 34 17 ...
##  $ speed_dist : num  8 40 28 154 128 90 180 260 340 187 ...
##  $ speed_cat  : Factor w/ 2 levels "high speed","low speed": 2 2 2 2 1 1 1 1 1 1 ...
##  $ speed_quali: Factor w/ 4 levels "[0,12)","[12,15)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ mean_score : num  4.67 18 13 61 50.67 ...
##  $ total_score: num  14 54 39 183 152 109 208 296 384 215 ...
##  $ dist_cat   : Factor w/ 2 levels "small distance",..: 1 1 1 2 2 1 2 2 2 2 ...

Alternatively, if you want to transform several numeric variables into categorical variables without changing the labels, it is best to use the transform() function. We illustrate this function with the mpg 齐鲁风采群英会帧 from the {ggplot2} package:

library(ggplot2)
mpg <- transform(mpg,
  cyl = factor(cyl),
  drv = factor(drv),
  fl = factor(fl),
  class = factor(class)
)

recode分类变量

如果您对当前标签不满意,则可以重新介绍分类变量的标签。在此示例中,我们更改标签如下:

  • “小距离”变成“短距离”
  • “大距离”变成“大距离”
dat$dist_cat <- recode(dat$dist_cat,
  "small distance" = "short distance",
  "big distance" = "large distance"
)

head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109
##         dist_cat
## 1 short distance
## 2 short distance
## 3 short distance
## 4 large distance
## 5 large distance
## 6 short distance

改变参考水平

对于某些分析,您可能希望更改级别的顺序。例如,如果要分析有关控制组和治疗组的齐鲁风采群英会,则可能希望将控制组设置为参考组。默认情况下,级别按字母顺序排序或通过其数值从数字从数字更改为因子。

  • 检查当前级别的顺序(参考的第一级):
levels(dat$dist_cat)
## [1] "short distance" "large distance"

在这种情况下,“短距离”是第一级,它是参考电平。它是第一级,因为当创建变量时,它最初设置为等于1的值。

  • 要更改参考级别:
dat$dist_cat <- relevel(dat$dist_cat, ref = "large distance")

levels(dat$dist_cat)
## [1] "large distance" "short distance"

现在是第一个,由此参考等级。

重命名变量名称

要重命名变量名称如下:

  • dist \(\右箭头\) distance
  • speed_dist. \(\右箭头\) speed_distance
  • dist_cat. \(\右箭头\) distance_cat

use the rename() command from the dplyr package:

dat <- rename(dat,
  distance = dist,
  speed_distance = speed_dist,
  distance_cat = dist_cat
)

names(dat) # display variable names
## [1] "speed"          "distance"       "speed_distance" "speed_cat"     
## [5] "speed_quali"    "mean_score"     "total_score"    "distance_cat"

手动创建齐鲁风采群英会帧

虽然大多数分析都在导入的齐鲁风采群英会框架上执行,但也可以直接在R中创建齐鲁风采群英会帧:

# Create the data frame named dat with 2 variables
dat <- data.frame(
  "variable1" = c(6, 12, NA, 3), # presence of 1 missing value (NA)
  "variable2" = c(3, 7, 9, 1)
)

# Print the data frame
dat
##   variable1 variable2
## 1         6         3
## 2        12         7
## 3        NA         9
## 4         3         1

合并两个齐鲁风采群英会帧

默认情况下,合并在公共变量上完成(具有相同名称的变量)。但是,如果它们没有相同的名称,则仍然可以通过指定其名称来合并两个齐鲁风采群英会帧:

dat1 <- data.frame(
  person = c(1:4),
  treatment = c("T1", "T2")
)

dat1
##   person treatment
## 1      1        T1
## 2      2        T2
## 3      3        T1
## 4      4        T2
dat2 <- data.frame(
  patient = c(1:4),
  age = c(56, 23, 32, 19),
  gender = c("M", "F", "F", "M")
)

dat2
##   patient age gender
## 1       1  56      M
## 2       2  23      F
## 3       3  32      F
## 4       4  19      M

We want to merge the two data frames by the subject number, but this number is referred as person in the first data frame and patient in the second data frame, so we need to indicate it:

merge(
  x = dat1, y = dat2,
  by.x = "person", by.y = "patient",
  all = TRUE
)
##   person treatment age gender
## 1      1        T1  56      M
## 2      2        T2  23      F
## 3      3        T1  32      F
## 4      4        T2  19      M

从另一个齐鲁风采群英会帧添加新的观察

为了从另一个齐鲁风采群英会帧添加新的观察,两个齐鲁风采群英会帧需要具有相同的列名称(但它们可以采用不同的顺序):

dat1
##   person treatment
## 1      1        T1
## 2      2        T2
## 3      3        T1
## 4      4        T2
dat3 <- data.frame(
  person = 5:8,
  treatment = c("T3")
)

dat3
##   person treatment
## 1      5        T3
## 2      6        T3
## 3      7        T3
## 4      8        T3
rbind(dat1, dat3) # r stands for row, so we bind data frames by row
##   person treatment
## 1      1        T1
## 2      2        T2
## 3      3        T1
## 4      4        T2
## 5      5        T3
## 6      6        T3
## 7      7        T3
## 8      8        T3

As you can see, data for persons 5 to 8 have been added at the end of the data frame dat1 (because dat1 comes before dat3 in the rbind() function).

从另一个齐鲁风采群英会框中添加新变量

It is also possible to add new variables to a data frame with the cbind() function. Unlike rbind(), column names do not have to be the same since they are added next to each other:

dat2
##   patient age gender
## 1       1  56      M
## 2       2  23      F
## 3       3  32      F
## 4       4  19      M
dat3
##   person treatment
## 1      5        T3
## 2      6        T3
## 3      7        T3
## 4      8        T3
cbind(dat2, dat3) # c stands for column, so we bind data frames by column
##   patient age gender person treatment
## 1       1  56      M      5        T3
## 2       2  23      F      6        T3
## 3       3  32      F      7        T3
## 4       4  19      M      8        T3

如果要从另一个齐鲁风采群英会框中添加特定变量,请执行以下操作:

dat_cbind <- cbind(dat2, dat3$treatment)

dat_cbind
##   patient age gender dat3$treatment
## 1       1  56      M             T3
## 2       2  23      F             T3
## 3       3  32      F             T3
## 4       4  19      M             T3
names(dat_cbind)[4] <- "treatment"

dat_cbind
##   patient age gender treatment
## 1       1  56      M        T3
## 2       2  23      F        T3
## 3       3  32      F        T3
## 4       4  19      M        T3

或者 more simply with the data.frame() function:

data.frame(dat2,
  treatment = dat3$treatment
)
##   patient age gender treatment
## 1       1  56      M        T3
## 2       2  23      F        T3
## 3       3  32      F        T3
## 4       4  19      M        T3

缺少值

对于许多分析来说,许多分析通常存在缺失的值(用于rStudio中的NA,用于“不适用”),因为许多包括缺失值的许多计算具有缺失的结果。

For instance, the mean of a series or variable with at least one NA will give a NA as a result. The data frame dat created in the previous section is used for this example:

dat
##   variable1 variable2
## 1         6         3
## 2        12         7
## 3        NA         9
## 4         3         1
意思是(dat$variable1)
## [1] NA

The na.omit() function avoids the NA result, doing as if there was no missing value:

意思是(na.omit(dat$variable1))
## [1] 7

此外,大多数基本功能包括处理缺失值的参数:

意思是(dat$variable1, na.rm = TRUE)
## [1] 7

is.na() 表示元素是否是缺失值:

is.na(dat)
##      variable1 variable2
## [1,]     FALSE     FALSE
## [2,]     FALSE     FALSE
## [3,]      TRUE     FALSE
## [4,]     FALSE     FALSE

请注意,“NA”作为字符串不被视为缺失值:

y <- c("NA", "2")

is.na(y)
## [1] FALSE FALSE

检查矢量或齐鲁风采群英会帧中是否存在至少一个缺失值:

anyNA(dat$variable2) # check for NA in variable2
## [1] FALSE
anyNA(dat) # check for NA in the whole data frame
## [1] TRUE
# or
any(is.na(dat)) # check for NA in the whole data frame
## [1] TRUE

尽管如此,对于某些类型的分析,NAS的齐鲁风采群英会帧仍然存在问题。存在几种替代方案以删除或赋予缺失值。

删除NAS.

简单的解决方案是删除包含至少一个缺失值的所有观察(即,行)。这是通过保持完整案例的仅观察来完成的:

dat_complete <- dat[complete.cases(dat), ]
dat_complete
##   variable1 variable2
## 1         6         3
## 2        12         7
## 4         3         1

删除具有缺失值的观察时要小心,特别是如果缺失值不是“随机丢失”。它不是因为它可以(并且容易)删除它们,您应该在所有情况下都要这样做。然而,这超出了本文的范围。

赋予NAS

Instead of removing observations with at least one NA, it is possible to impute them, that is, replace them by some values such as the median or the mode of the variable. This can be done easily with the command impute() from the package imputeMissings:

library(imputeMissings)

dat_imputed <- impute(dat) # default method is median/mode
dat_imputed
##   variable1 variable2
## 1         6         3
## 2        12         7
## 3         6         9
## 4         3         1

使用中位/模式方法(默认值)时,字符向量和因子被模式省略。数字和整数向量被中位数施加。再次仔细使用避难所。其他包装提供更先进的撤销技术。但是,我们将这篇文章保持简单,直接,因为先进的避免超出了R的介绍性齐鲁风采群英会操纵范围。

规模

缩放 (也称为标准化)在主成分分析(PCA)之前通常使用变量1 当齐鲁风采群英会帧的变量具有不同的单位时。请记住,缩放变量意味着它会计算该变量的均值和标准偏差。然后通过减去该变量的平均值并除以该变量的标准偏差来“缩放”该变量的每个值(所以每行)。正式:

\ [z = \ frac {x - \ bar {x}} {s}}

在哪里 \(\ bar {x})\(s \) 分别是变量的平均值和标准偏差。

To scale one or more variables in R use scale():

dat_scaled <- scale(dat_imputed)

head(dat_scaled)
##       variable1  variable2
## [1,] -0.1986799 -0.5477226
## [2,]  1.3907590  0.5477226
## [3,] -0.1986799  1.0954451
## [4,] -0.9933993 -1.0954451

日期和时间

日期

在r,默认日期格式遵循ISO 8601国际标准的规则,该规则表达为“2001-02-13”(YYYY-MM-DD)。2

日期可以由字符串或数字定义。例如,2016年10月1日:

as.Date("01/10/16", format = "%d/%m/%y")
## [1] "2016-10-01"
as.Date(274, origin = "2016-01-01") # there are 274 days between the origin and October 1st, 2016
## [1] "2016-10-01"

时代

日期和时间向量的示例:

dates <- c("02/27/92", "02/27/99", "01/14/92")
times <- c("23:03:20", "22:29:56", "01:03:30")

x <- paste(dates, times)
x
## [1] "02/27/92 23:03:20" "02/27/99 22:29:56" "01/14/92 01:03:30"
strptime(x,
  format = "%m/%d/%y %H:%M:%S"
)
## [1] "1992-02-27 23:03:20 CET" "1999-02-27 22:29:56 CET"
## [3] "1992-01-14 01:03:30 CET"

Find more information on how to express a date and time format with help(strptime).

从日期提取

我们可以提取:

  • 平日
  • 几个月
  • 宿舍
y <- strptime(x,
  format = "%m/%d/%y %H:%M:%S"
)

y
## [1] "1992-02-27 23:03:20 CET" "1999-02-27 22:29:56 CET"
## [3] "1992-01-14 01:03:30 CET"
平日(y, abbreviate = FALSE)
## [1] "Thursday" "Saturday" "Tuesday"
几个月(y, abbreviate = FALSE)
## [1] "February" "February" "January"
宿舍(y, abbreviate = FALSE)
## [1] "Q1" "Q1" "Q1"
format(y, "%Y") # 4-digit year
## [1] "1992" "1999" "1992"
format(y, "%y") # 2-digit year
## [1] "92" "99" "92"

出口和储蓄

If a copy-paste is not sufficient, you can save an object in R format with save():

save(dat, file = "dat.Rdata")

或者 using write.table(), write.csv() 或者 write.xlsx():

# in a text format
write.table(dat, "dat.txt", row = FALSE, sep = "\t", quote = FALSE)

# in csv
write.csv(dat, file = "dat.csv", row.names = FALSE, quote = FALSE)

# in excel
# install.packages("openxlsx")
library(openxlsx)
write.xlsx(dat, file = "dat.xlsx")

如果您需要将每个结果发送到文件而不是控制台:

sink("filename")

(Don’t forget to stop it with sink().)

寻找帮助

您可以随时找到一些帮助:

  • a function: ?function 或者 help(function)
  • 一袋: help(package = packagename)
  • a concept: help.search("concept") 或者 apropos("concept")

否则,谷歌是你最好的朋友!

谢谢阅读。我希望这篇文章有助于您在RStudio中操纵您的齐鲁风采群英会。现在你知道 如何将齐鲁风采群英会帧导入r 如何操纵它,下一步可能是学习如何执行 r的描述性统计。如果您正在寻找更高级的统计分析,请查看所有 关于R.的文章.

一如既往,如果您有问题或与本文所涵盖的主题相关的建议,请将其添加为评论,以便其他读者可以从讨论中受益。


  1. 主成分分析(PCA)是一种用于探索齐鲁风采群英会分析的有用技术,允许更好地可视化具有大量变量的齐鲁风采群英会帧中存在的变化。当有许多变量时,无法以原始格式轻松说明齐鲁风采群英会。要对此进行计数,PCA采用具有许多变量的齐鲁风采群英会帧,并通过将原始变量转换为较少数量的“主组件”来简化它。第一个维度包含齐鲁风采群英会帧中最方差等,依此类推,尺寸是不相关的。请注意,PCA在定量变量上完成。↩︎

  2. 有关您的信息,请注意,每个软件都不相同!例如,Excel使用不同的格式。↩︎



喜欢这篇文章?

获取更新 每次发布新文章。
任何垃圾邮件都没有任何垃圾邮件。
分享: