R Builtin Data Structures¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/statcomp

>>> Link to Python version 1, 2, 3

Generate a sequence¶

Sequences¶

Generate a sequence: seq()
Repeat a vector: rep()

In [2]:

seq(1,100)

In [3]:

seq(100,1)

In [5]:

rep(10,5)

In [6]:

rep(c(1, 2, 3), 5)

Vectors¶

Numerical vectors
Logical vectors
Characters
Length of a vector
Vector calculations

Mathematical functions¶

sqrt(), log()
sin(),cos(), tan()

Vectors and matrices¶

Matrices¶

Create a matrix: matrix()
Dimension of a matrix: dim()
How many elements in a matrix: length()
Extract elements from a matrix.
Replace elements with new entries.
Create special matrices: diagonal matrix, identity matrix, zero matrix...

Matrix multiplications: %*%
Matrix inverse: solve()
Transpose of a matrix: t()
Element-wise operation with a matrix.
Combine two or more matrices: rbind(), cbind()

Array¶

An array is a high dimensional matrix.
A matrix is a special case of an array when the dimension is two.
A vector is a special array when their is no dimension (In R the dimension is usually dropped in this situation)

List¶

Special data structure that matrix could not handle.
- Data length are not the same.
- Data type are not the same.
- Nested data structure within a list.
Create a list: list()
Extract elements of a list: [[]] or $
Delete an element within a list: set NULL to that element.

Data frame¶

data.frame(): tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.
In most cases, the operation with a data frame is similar to matrix operation.

See also dplyr package.
- written by Hadley Wickham of RStudio
- everything dplyr does could already be done with base R, but it greatly simplifies existing functionality in R.
- it provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames.
- the dplyr functions are very fast, as many key operations are coded in C++.

Discussion¶

What type of data structure would you choose when you meet the following situations.

Data are of the same length but different types.
Data are not of the same length.
Hierarchical structure of the data.

Managing data frames with the `dplyr` package¶

`dplyr` Grammar¶

Some of the key "verbs" provided by the dplyr package are

select: return a subset of the columns of a data frame, using a flexible notation
filter: extract a subset of rows from a data frame based on logical conditions
arrange: reorder rows of a data frame
rename: rename variables in a data frame
mutate: add new variables/columns or transform existing variables
summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata
%>%: the "pipe" operator is used to connect multiple verb actions together into a pipeline

Common `dplyr` Function Properties¶

All of the functions have a few common characteristics. In particular,

The first argument is a data frame.
The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names).
The return result of a function is a new data frame.
Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be tidy. In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.

Installing the dplyr package
```
  install.packages("dplyr")
```
After installing the package it is important that you load it into your R session with the library() function.

In [9]:

library(dplyr)

`select()`¶

We will use a dataset containing air pollution and temperature data for the city of Chicago in the U.S.

In [40]:

chicago <- readRDS("data/chicago.rds")
dim(chicago)
str(chicago)

6940
8

'data.frame':	6940 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
 $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
 $ date      : Date, format: "1987-01-01" "1987-01-02" ...
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...

Sometimes you may want to use only a couple of variables out of many.

In [14]:

names(chicago)[1:3]
subset <- select(chicago, city:dptp)
head(subset)

'city'
'tmpd'
'dptp'

A data.frame: 6 × 3
	city	tmpd	dptp
	<chr>	<dbl>	<dbl>
1	chic	31.5	31.500
2	chic	33.0	29.875
3	chic	33.0	27.375
4	chic	29.0	28.625
5	chic	32.0	28.875
6	chic	40.0	35.125

Sometimes you may want to drop some variables that are not useful.

In [16]:

select(chicago, -(city:dptp))

A data.frame: 6940 × 5
	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
	<date>	<dbl>	<dbl>	<dbl>	<dbl>
1	1987-01-01	NA	34.00000	4.250000	19.98810
2	1987-01-02	NA	NA	3.304348	23.19099
3	1987-01-03	NA	34.16667	3.333333	23.81548
4	1987-01-04	NA	47.00000	4.375000	30.43452
5	1987-01-05	NA	NA	4.750000	30.33333
6	1987-01-06	NA	48.00000	5.833333	25.77233
7	1987-01-07	NA	41.00000	9.291667	20.58171
8	1987-01-08	NA	36.00000	11.291667	17.03723
9	1987-01-09	NA	33.28571	4.500000	23.38889
10	1987-01-10	NA	NA	4.958333	19.54167
11	1987-01-11	NA	22.00000	17.541667	13.70139
12	1987-01-12	NA	26.00000	8.000000	33.02083
13	1987-01-13	NA	53.00000	4.958333	38.06142
14	1987-01-14	NA	43.00000	4.208333	32.19444
15	1987-01-15	NA	28.83333	4.458333	18.87131
16	1987-01-16	NA	19.00000	7.916667	19.46667
17	1987-01-17	NA	NA	5.833333	20.70833
18	1987-01-18	NA	39.00000	6.375000	21.03333
19	1987-01-19	NA	32.00000	14.875000	17.17409
20	1987-01-20	NA	38.00000	7.250000	21.61021
21	1987-01-21	NA	32.85714	8.913043	24.52083
22	1987-01-22	NA	52.00000	10.500000	16.98798
23	1987-01-23	NA	55.00000	14.625000	14.66250
24	1987-01-24	NA	38.00000	10.083333	18.69167
25	1987-01-25	NA	NA	6.666667	26.30417
26	1987-01-26	NA	71.00000	4.583333	32.42143
27	1987-01-27	NA	39.33333	6.000000	30.69306
28	1987-01-28	NA	47.00000	6.875000	29.12943
29	1987-01-29	NA	35.00000	2.916667	28.14529
30	1987-01-30	NA	59.00000	8.791667	19.79861
⋮	⋮	⋮	⋮	⋮	⋮
6911	2005-12-02	NA	19.50	9.156250	23.29167
6912	2005-12-03	13.34286	20.00	10.333333	25.19444
6913	2005-12-04	15.30000	15.50	13.177083	21.70833
6914	2005-12-05	NA	30.00	6.447917	28.38889
6915	2005-12-06	24.61667	33.00	4.701540	29.08333
6916	2005-12-07	37.80000	39.00	3.916214	34.30952
6917	2005-12-08	24.30000	31.00	5.995265	34.22222
6918	2005-12-09	25.45000	22.00	5.958333	31.41667
6919	2005-12-10	18.20000	30.00	9.135417	28.70833
6920	2005-12-11	10.60000	14.00	11.333333	22.55556
6921	2005-12-12	19.22500	28.75	5.031250	39.74621
6922	2005-12-13	26.50000	21.00	6.628623	29.56944
6923	2005-12-14	26.90000	16.00	3.802083	30.63384
6924	2005-12-15	14.40000	16.50	4.895833	25.43056
6925	2005-12-16	11.00000	22.00	11.166667	16.87500
6926	2005-12-17	13.80000	20.00	8.593750	20.73611
6927	2005-12-18	12.20000	17.50	13.552083	19.11111
6928	2005-12-19	21.15000	21.00	8.058877	31.79167
6929	2005-12-20	25.75000	32.00	3.849185	32.89773
6930	2005-12-21	37.92857	59.50	3.663949	34.86111
6931	2005-12-22	36.65000	42.50	5.385417	33.73026
6932	2005-12-23	32.90000	34.50	6.906250	29.08333
6933	2005-12-24	30.77143	25.20	1.770833	31.98611
6934	2005-12-25	6.70000	8.00	14.354167	13.79167
6935	2005-12-26	8.40000	8.50	14.041667	16.81944
6936	2005-12-27	23.56000	27.00	4.468750	23.50000
6937	2005-12-28	17.75000	27.50	3.260417	19.28563
6938	2005-12-29	7.45000	23.50	6.794837	19.97222
6939	2005-12-30	15.05714	19.20	3.034420	22.80556
6940	2005-12-31	15.00000	23.50	2.531250	13.25000

If you wanted to keep every variable that ends with a "2", we could do

In [18]:

subset <- select(chicago, ends_with("2"))
str(subset)

'data.frame':	6940 obs. of  4 variables:
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...

Or if we wanted to keep every variable that starts with a "d", we could do

In [20]:

subset <- select(chicago, starts_with("d"))
str(subset)

'data.frame':	6940 obs. of  2 variables:
 $ dptp: num  31.5 29.9 27.4 28.6 28.9 ...
 $ date: Date, format: "1987-01-01" "1987-01-02" ...

`filter()`¶

The filter() function is used to extract subsets of rows from a data frame.

In [21]:

chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
summary(chic.f$pm25tmean2)

'data.frame':	194 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  23 28 55 59 57 57 75 61 73 78 ...
 $ dptp      : num  21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
 $ date      : Date, format: "1998-01-17" "1998-01-23" ...
 $ pm25tmean2: num  38.1 34 39.4 35.4 33.3 ...
 $ pm10tmean2: num  32.5 38.7 34 28.5 35 ...
 $ o3tmean2  : num  3.18 1.75 10.79 14.3 20.66 ...
 $ no2tmean2 : num  25.3 29.4 25.3 31.4 26.8 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.05   32.12   35.04   36.63   39.53   61.50

We could for example extract the rows where PM2.5 is greater than 30 and temperature is greater than 80 degrees Fahrenheit.

In [22]:

chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)

A data.frame: 17 × 3
date	tmpd	pm25tmean2
<date>	<dbl>	<dbl>
1998-08-23	81	39.60000
1998-09-06	81	31.50000
2001-07-20	82	32.30000
2001-08-01	84	43.70000
2001-08-08	85	38.83750
2001-08-09	84	38.20000
2002-06-20	82	33.00000
2002-06-23	82	42.50000
2002-07-08	81	33.10000
2002-07-18	82	38.85000
2003-06-25	82	33.90000
2003-07-04	84	32.90000
2005-06-24	86	31.85714
2005-06-27	82	51.53750
2005-06-28	85	31.20000
2005-07-17	84	32.70000
2005-08-03	84	37.90000

`arrange()`¶

The arrange() function is used to reorder rows of a data frame according to one of the variables/columns.

Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.

In [27]:

chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)
tail(select(chicago, date, pm25tmean2), 3)

A data.frame: 3 × 2
	date	pm25tmean2
	<date>	<dbl>
1	1987-01-01	NA
2	1987-01-02	NA
3	1987-01-03	NA

A data.frame: 3 × 2
	date	pm25tmean2
	<date>	<dbl>
6938	2005-12-29	7.45000
6939	2005-12-30	15.05714
6940	2005-12-31	15.00000

Columns can be arranged in descending order too by useing the special desc() operator. Looking at the first three and last three rows shows the dates in descending order.

In [36]:

chicago2 <- arrange(chicago, desc(date))
head(select(chicago2, date, pm25tmean2), 3)
tail(select(chicago2, date, pm25tmean2), 3)

A data.frame: 3 × 2
	date	pm25tmean2
	<date>	<dbl>
1	2005-12-31	15.00000
2	2005-12-30	15.05714
3	2005-12-29	7.45000

A data.frame: 3 × 2
	date	pm25tmean2
	<date>	<dbl>
6938	1987-01-03	NA
6939	1987-01-02	NA
6940	1987-01-01	NA

`rename()`¶

Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function is designed to make this process easier.

Here you can see the names of the first five variables in the chicago data frame. Now we rename the awkward variable names.

In [37]:

head(chicago[, 1:5], 3)
chicago3 <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago3[, 1:5], 3) # with new variable name

A data.frame: 3 × 5
	city	tmpd	dptp	date	pm25tmean2
	<chr>	<dbl>	<dbl>	<date>	<dbl>
1	chic	31.5	31.500	1987-01-01	NA
2	chic	33.0	29.875	1987-01-02	NA
3	chic	33.0	27.375	1987-01-03	NA

A data.frame: 3 × 5
	city	tmpd	dewpoint	date	pm25
	<chr>	<dbl>	<dbl>	<date>	<dbl>
1	chic	31.5	31.500	1987-01-01	NA
2	chic	33.0	29.875	1987-01-02	NA
3	chic	33.0	27.375	1987-01-03	NA

`mutate()`¶

The mutate() function exists to compute transformations of variables in a data frame.

For example, with air pollution data, we often want to detrend the data by subtracting the mean from the data.

In [42]:

chicago4 <- mutate(chicago3, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago4)

A data.frame: 6 × 9
	city	tmpd	dewpoint	date	pm25	pm10tmean2	o3tmean2	no2tmean2	pm25detrend
	<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	chic	31.5	31.500	1987-01-01	NA	34.00000	4.250000	19.98810	NA
2	chic	33.0	29.875	1987-01-02	NA	NA	3.304348	23.19099	NA
3	chic	33.0	27.375	1987-01-03	NA	34.16667	3.333333	23.81548	NA
4	chic	29.0	28.625	1987-01-04	NA	47.00000	4.375000	30.43452	NA
5	chic	32.0	28.875	1987-01-05	NA	NA	4.750000	30.33333	NA
6	chic	40.0	35.125	1987-01-06	NA	48.00000	5.833333	25.77233	NA

`group_by()`¶

The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is.

First, we can create a year variable using as.POSIXlt().
Now we can create a separate data frame that splits the original data frame by year.

In [45]:

chicago5 <- mutate(chicago3, year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago5, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
          o3 = max(o3tmean2, na.rm = TRUE),
          no2 = median(no2tmean2, na.rm = TRUE))

A tibble: 19 × 4
year	pm25	o3	no2
<dbl>	<dbl>	<dbl>	<dbl>
1987	NaN	62.96966	23.49369
1988	NaN	61.67708	24.52296
1989	NaN	59.72727	26.14062
1990	NaN	52.22917	22.59583
1991	NaN	63.10417	21.38194
1992	NaN	50.82870	24.78921
1993	NaN	44.30093	25.76993
1994	NaN	52.17844	28.47500
1995	NaN	66.58750	27.26042
1996	NaN	58.39583	26.38715
1997	NaN	56.54167	25.48143
1998	18.26467	50.66250	24.58649
1999	18.49646	57.48864	24.66667
2000	16.93806	55.76103	23.46082
2001	16.92632	51.81984	25.06522
2002	15.27335	54.88043	22.73750
2003	15.23183	56.16608	24.62500
2004	14.62864	44.48240	23.39130
2005	16.18556	58.84126	22.62387

In a slightly more complicated example, we might want to know what are the average levels of ozone (o3) and nitrogen dioxide (no2) within quintiles of pm25. A slicker way to do this would be through a regression model, but we can actually do this quickly with group_by() and summarize().

First, we can create a categorical variable of pm25 divided into quintiles.
Now we can group the data frame by the pm25.quint variable.
Finally, we can compute the mean of o3 and no2 within quintiles of pm25.

In [46]:

qq <- quantile(chicago3$pm25, seq(0, 1, 0.2), na.rm = TRUE)
chicago6 <- mutate(chicago3, pm25.quint = cut(pm25, qq))
quint <- group_by(chicago6, pm25.quint)
summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),
          no2 = mean(no2tmean2, na.rm = TRUE))

A tibble: 6 × 3
pm25.quint	o3	no2
<fct>	<dbl>	<dbl>
(1.7,8.7]	21.66401	17.99129
(8.7,12.4]	20.38248	22.13004
(12.4,16.7]	20.66160	24.35708
(16.7,22.6]	19.88122	27.27132
(22.6,61.5]	20.31775	29.64427
NA	18.79044	25.77585

Summary¶

The dplyr package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of group_by() and summarize().

dplyr can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package
dplyr can be integrated with the data.table package for large fast tables
The dplyr package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!