Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
seq(1,100)
seq(100,1)
rep(10,5)
rep(c(1, 2, 3), 5)
Numerical vectors
Logical vectors
Characters
Length of a vector
Vector calculations
sqrt(), log()
sin(),cos(), tan()
Matrix multiplications: %*%
Matrix inverse: solve()
Transpose of a matrix: t()
Element-wise operation with a matrix.
Combine two or more matrices: rbind(), cbind()
An array is a high dimensional matrix.
A matrix is a special case of an array when the dimension is two.
A vector is a special array when their is no dimension (In R the dimension is usually dropped in this situation)
Special data structure that matrix could not handle.
Data length are not the same.
Data type are not the same.
Nested data structure within a list.
Create a list: list()
Extract elements of a list: [[]]
or $
Delete an element within a list: set NULL
to that element.
data.frame()
: tightly coupled collections of variables which share
many of the properties of matrices and of lists, used as the
fundamental data structure by most of R's modeling software.
In most cases, the operation with a data frame is similar to matrix operation.
See also dplyr
package.
written by Hadley Wickham of RStudio
everything dplyr does could already be done with base R, but it greatly simplifies existing functionality in R.
it provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames.
the dplyr functions are very fast, as many key operations are coded in C++.
What type of data structure would you choose when you meet the following situations.
Data are of the same length but different types.
Data are not of the same length.
Hierarchical structure of the data.
dplyr
package¶dplyr
Grammar¶Some of the key "verbs" provided by the dplyr
package are
select
: return a subset of the columns of a data frame, using a flexible notationfilter
: extract a subset of rows from a data frame based on logical conditionsarrange
: reorder rows of a data framerename
: rename variables in a data framemutate
: add new variables/columns or transform existing variablessummarise
/ summarize
: generate summary statistics of different
variables in the data frame, possibly within strata%>%
: the "pipe" operator is used to connect multiple verb actions together into a pipelinedplyr
Function Properties¶All of the functions have a few common characteristics. In particular,
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $
operator (just use the column names).
The return result of a function is a new data frame.
Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be tidy. In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.
Installing the dplyr
package
install.packages("dplyr")
After installing the package it is important that you load it into your R session with the library()
function.
library(dplyr)
select()
¶chicago <- readRDS("data/chicago.rds")
dim(chicago)
str(chicago)
'data.frame': 6940 obs. of 8 variables: $ city : chr "chic" "chic" "chic" "chic" ... $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ... $ dptp : num 31.5 29.9 27.4 28.6 28.9 ... $ date : Date, format: "1987-01-01" "1987-01-02" ... $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ... $ pm10tmean2: num 34 NA 34.2 47 NA ... $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ... $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
names(chicago)[1:3]
subset <- select(chicago, city:dptp)
head(subset)
city | tmpd | dptp | |
---|---|---|---|
<chr> | <dbl> | <dbl> | |
1 | chic | 31.5 | 31.500 |
2 | chic | 33.0 | 29.875 |
3 | chic | 33.0 | 27.375 |
4 | chic | 29.0 | 28.625 |
5 | chic | 32.0 | 28.875 |
6 | chic | 40.0 | 35.125 |
select(chicago, -(city:dptp))
date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 | |
---|---|---|---|---|---|
<date> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
2 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |
3 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
4 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |
5 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 |
6 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |
7 | 1987-01-07 | NA | 41.00000 | 9.291667 | 20.58171 |
8 | 1987-01-08 | NA | 36.00000 | 11.291667 | 17.03723 |
9 | 1987-01-09 | NA | 33.28571 | 4.500000 | 23.38889 |
10 | 1987-01-10 | NA | NA | 4.958333 | 19.54167 |
11 | 1987-01-11 | NA | 22.00000 | 17.541667 | 13.70139 |
12 | 1987-01-12 | NA | 26.00000 | 8.000000 | 33.02083 |
13 | 1987-01-13 | NA | 53.00000 | 4.958333 | 38.06142 |
14 | 1987-01-14 | NA | 43.00000 | 4.208333 | 32.19444 |
15 | 1987-01-15 | NA | 28.83333 | 4.458333 | 18.87131 |
16 | 1987-01-16 | NA | 19.00000 | 7.916667 | 19.46667 |
17 | 1987-01-17 | NA | NA | 5.833333 | 20.70833 |
18 | 1987-01-18 | NA | 39.00000 | 6.375000 | 21.03333 |
19 | 1987-01-19 | NA | 32.00000 | 14.875000 | 17.17409 |
20 | 1987-01-20 | NA | 38.00000 | 7.250000 | 21.61021 |
21 | 1987-01-21 | NA | 32.85714 | 8.913043 | 24.52083 |
22 | 1987-01-22 | NA | 52.00000 | 10.500000 | 16.98798 |
23 | 1987-01-23 | NA | 55.00000 | 14.625000 | 14.66250 |
24 | 1987-01-24 | NA | 38.00000 | 10.083333 | 18.69167 |
25 | 1987-01-25 | NA | NA | 6.666667 | 26.30417 |
26 | 1987-01-26 | NA | 71.00000 | 4.583333 | 32.42143 |
27 | 1987-01-27 | NA | 39.33333 | 6.000000 | 30.69306 |
28 | 1987-01-28 | NA | 47.00000 | 6.875000 | 29.12943 |
29 | 1987-01-29 | NA | 35.00000 | 2.916667 | 28.14529 |
30 | 1987-01-30 | NA | 59.00000 | 8.791667 | 19.79861 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
6911 | 2005-12-02 | NA | 19.50 | 9.156250 | 23.29167 |
6912 | 2005-12-03 | 13.34286 | 20.00 | 10.333333 | 25.19444 |
6913 | 2005-12-04 | 15.30000 | 15.50 | 13.177083 | 21.70833 |
6914 | 2005-12-05 | NA | 30.00 | 6.447917 | 28.38889 |
6915 | 2005-12-06 | 24.61667 | 33.00 | 4.701540 | 29.08333 |
6916 | 2005-12-07 | 37.80000 | 39.00 | 3.916214 | 34.30952 |
6917 | 2005-12-08 | 24.30000 | 31.00 | 5.995265 | 34.22222 |
6918 | 2005-12-09 | 25.45000 | 22.00 | 5.958333 | 31.41667 |
6919 | 2005-12-10 | 18.20000 | 30.00 | 9.135417 | 28.70833 |
6920 | 2005-12-11 | 10.60000 | 14.00 | 11.333333 | 22.55556 |
6921 | 2005-12-12 | 19.22500 | 28.75 | 5.031250 | 39.74621 |
6922 | 2005-12-13 | 26.50000 | 21.00 | 6.628623 | 29.56944 |
6923 | 2005-12-14 | 26.90000 | 16.00 | 3.802083 | 30.63384 |
6924 | 2005-12-15 | 14.40000 | 16.50 | 4.895833 | 25.43056 |
6925 | 2005-12-16 | 11.00000 | 22.00 | 11.166667 | 16.87500 |
6926 | 2005-12-17 | 13.80000 | 20.00 | 8.593750 | 20.73611 |
6927 | 2005-12-18 | 12.20000 | 17.50 | 13.552083 | 19.11111 |
6928 | 2005-12-19 | 21.15000 | 21.00 | 8.058877 | 31.79167 |
6929 | 2005-12-20 | 25.75000 | 32.00 | 3.849185 | 32.89773 |
6930 | 2005-12-21 | 37.92857 | 59.50 | 3.663949 | 34.86111 |
6931 | 2005-12-22 | 36.65000 | 42.50 | 5.385417 | 33.73026 |
6932 | 2005-12-23 | 32.90000 | 34.50 | 6.906250 | 29.08333 |
6933 | 2005-12-24 | 30.77143 | 25.20 | 1.770833 | 31.98611 |
6934 | 2005-12-25 | 6.70000 | 8.00 | 14.354167 | 13.79167 |
6935 | 2005-12-26 | 8.40000 | 8.50 | 14.041667 | 16.81944 |
6936 | 2005-12-27 | 23.56000 | 27.00 | 4.468750 | 23.50000 |
6937 | 2005-12-28 | 17.75000 | 27.50 | 3.260417 | 19.28563 |
6938 | 2005-12-29 | 7.45000 | 23.50 | 6.794837 | 19.97222 |
6939 | 2005-12-30 | 15.05714 | 19.20 | 3.034420 | 22.80556 |
6940 | 2005-12-31 | 15.00000 | 23.50 | 2.531250 | 13.25000 |
subset <- select(chicago, ends_with("2"))
str(subset)
'data.frame': 6940 obs. of 4 variables: $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ... $ pm10tmean2: num 34 NA 34.2 47 NA ... $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ... $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
subset <- select(chicago, starts_with("d"))
str(subset)
'data.frame': 6940 obs. of 2 variables: $ dptp: num 31.5 29.9 27.4 28.6 28.9 ... $ date: Date, format: "1987-01-01" "1987-01-02" ...
filter()
¶filter()
function is used to extract subsets of rows from a data frame.chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
summary(chic.f$pm25tmean2)
'data.frame': 194 obs. of 8 variables: $ city : chr "chic" "chic" "chic" "chic" ... $ tmpd : num 23 28 55 59 57 57 75 61 73 78 ... $ dptp : num 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ... $ date : Date, format: "1998-01-17" "1998-01-23" ... $ pm25tmean2: num 38.1 34 39.4 35.4 33.3 ... $ pm10tmean2: num 32.5 38.7 34 28.5 35 ... $ o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ... $ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ...
Min. 1st Qu. Median Mean 3rd Qu. Max. 30.05 32.12 35.04 36.63 39.53 61.50
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)
date | tmpd | pm25tmean2 |
---|---|---|
<date> | <dbl> | <dbl> |
1998-08-23 | 81 | 39.60000 |
1998-09-06 | 81 | 31.50000 |
2001-07-20 | 82 | 32.30000 |
2001-08-01 | 84 | 43.70000 |
2001-08-08 | 85 | 38.83750 |
2001-08-09 | 84 | 38.20000 |
2002-06-20 | 82 | 33.00000 |
2002-06-23 | 82 | 42.50000 |
2002-07-08 | 81 | 33.10000 |
2002-07-18 | 82 | 38.85000 |
2003-06-25 | 82 | 33.90000 |
2003-07-04 | 84 | 32.90000 |
2005-06-24 | 86 | 31.85714 |
2005-06-27 | 82 | 51.53750 |
2005-06-28 | 85 | 31.20000 |
2005-07-17 | 84 | 32.70000 |
2005-08-03 | 84 | 37.90000 |
arrange()
¶The arrange()
function is used to reorder rows of a data frame according to one of the variables/columns.
chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)
tail(select(chicago, date, pm25tmean2), 3)
date | pm25tmean2 | |
---|---|---|
<date> | <dbl> | |
1 | 1987-01-01 | NA |
2 | 1987-01-02 | NA |
3 | 1987-01-03 | NA |
date | pm25tmean2 | |
---|---|---|
<date> | <dbl> | |
6938 | 2005-12-29 | 7.45000 |
6939 | 2005-12-30 | 15.05714 |
6940 | 2005-12-31 | 15.00000 |
desc()
operator. Looking at the first three and last three rows shows the dates in descending order.chicago2 <- arrange(chicago, desc(date))
head(select(chicago2, date, pm25tmean2), 3)
tail(select(chicago2, date, pm25tmean2), 3)
date | pm25tmean2 | |
---|---|---|
<date> | <dbl> | |
1 | 2005-12-31 | 15.00000 |
2 | 2005-12-30 | 15.05714 |
3 | 2005-12-29 | 7.45000 |
date | pm25tmean2 | |
---|---|---|
<date> | <dbl> | |
6938 | 1987-01-03 | NA |
6939 | 1987-01-02 | NA |
6940 | 1987-01-01 | NA |
rename()
¶Renaming a variable in a data frame in R is surprisingly hard to do! The rename()
function is designed to make this process easier.
chicago
data frame. Now we rename the awkward variable names.head(chicago[, 1:5], 3)
chicago3 <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago3[, 1:5], 3) # with new variable name
city | tmpd | dptp | date | pm25tmean2 | |
---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | |
1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |
2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |
3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |
city | tmpd | dewpoint | date | pm25 | |
---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | |
1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |
2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |
3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |
mutate()
¶The mutate()
function exists to compute transformations of variables in a data frame.
chicago4 <- mutate(chicago3, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago4)
city | tmpd | dewpoint | date | pm25 | pm10tmean2 | o3tmean2 | no2tmean2 | pm25detrend | |
---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 | NA |
2 | chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 | NA |
3 | chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 | NA |
4 | chic | 29.0 | 28.625 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 | NA |
5 | chic | 32.0 | 28.875 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 | NA |
6 | chic | 40.0 | 35.125 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 | NA |
group_by()
¶The group_by()
function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is.
First, we can create a year
variable using as.POSIXlt()
.
Now we can create a separate data frame that splits the original data frame by year.
chicago5 <- mutate(chicago3, year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago5, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
year | pm25 | o3 | no2 |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
1987 | NaN | 62.96966 | 23.49369 |
1988 | NaN | 61.67708 | 24.52296 |
1989 | NaN | 59.72727 | 26.14062 |
1990 | NaN | 52.22917 | 22.59583 |
1991 | NaN | 63.10417 | 21.38194 |
1992 | NaN | 50.82870 | 24.78921 |
1993 | NaN | 44.30093 | 25.76993 |
1994 | NaN | 52.17844 | 28.47500 |
1995 | NaN | 66.58750 | 27.26042 |
1996 | NaN | 58.39583 | 26.38715 |
1997 | NaN | 56.54167 | 25.48143 |
1998 | 18.26467 | 50.66250 | 24.58649 |
1999 | 18.49646 | 57.48864 | 24.66667 |
2000 | 16.93806 | 55.76103 | 23.46082 |
2001 | 16.92632 | 51.81984 | 25.06522 |
2002 | 15.27335 | 54.88043 | 22.73750 |
2003 | 15.23183 | 56.16608 | 24.62500 |
2004 | 14.62864 | 44.48240 | 23.39130 |
2005 | 16.18556 | 58.84126 | 22.62387 |
In a slightly more complicated example, we might want to know what are the average levels of ozone (o3
) and nitrogen dioxide (no2
) within quintiles of pm25
. A slicker way to do this would be through a regression model, but we can actually do this quickly with group_by()
and summarize()
.
First, we can create a categorical variable of pm25
divided into quintiles.
Now we can group the data frame by the pm25.quint
variable.
Finally, we can compute the mean of o3
and no2
within quintiles of pm25
.
qq <- quantile(chicago3$pm25, seq(0, 1, 0.2), na.rm = TRUE)
chicago6 <- mutate(chicago3, pm25.quint = cut(pm25, qq))
quint <- group_by(chicago6, pm25.quint)
summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),
no2 = mean(no2tmean2, na.rm = TRUE))
pm25.quint | o3 | no2 |
---|---|---|
<fct> | <dbl> | <dbl> |
(1.7,8.7] | 21.66401 | 17.99129 |
(8.7,12.4] | 20.38248 | 22.13004 |
(12.4,16.7] | 20.66160 | 24.35708 |
(16.7,22.6] | 19.88122 | 27.27132 |
(22.6,61.5] | 20.31775 | 29.64427 |
NA | 18.79044 | 25.77585 |
The dplyr
package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of group_by()
and summarize()
.
dplyr
can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package
dplyr
can be integrated with the data.table
package for large fast tables
The dplyr
package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!