# R Builtin Data Structures


Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)

[https://feng.li/statcomp](https://feng.li/statcomp)

_>>> Link to Python version_ [1](https://feng.li/files/python/P02-Python-Data-Structures/L02.1-Python-Builtin-Data-Structures.slides.html), [2](https://feng.li/files/python/P02-Python-Data-Structures/L02.2-Data-Wrangling-with-Pandas.slides.html), [3](https://feng.li/files/python/P02-Python-Data-Structures/L02.3-Manipulating-DataFrames-with-Pandas.slides.html)

## Generate a sequence

### Sequences

-   Generate a sequence: `seq()`

-   Repeat a vector: `rep()`

In [2]:
seq(1,100)

In [3]:
seq(100,1)

In [5]:
rep(10,5)

In [6]:
rep(c(1, 2, 3), 5)

### Vectors

-   Numerical vectors

-   Logical vectors

-   Characters

-   Length of a vector

-   Vector calculations

### Mathematical functions

-   `sqrt(), log()`

-   `sin(),cos(), tan()`

## Vectors and matrices

### Matrices

-   Create a matrix: `matrix()`

-   Dimension of a matrix: `dim()`

-   How many elements in a matrix: `length()`

-   Extract elements from a matrix.

-   Replace elements with new entries.

-   Create special matrices: diagonal matrix, identity matrix, zero
    matrix\...

-   Matrix multiplications: `%*%`

-   Matrix inverse: `solve()`

-   Transpose of a matrix: `t()`

-   Element-wise operation with a matrix.

-   Combine two or more matrices: `rbind(), cbind()`

### Array

-   An array is a high dimensional matrix.

-   A matrix is a special case of an array when the dimension is two.

-   A vector is a special array when their is no dimension (In R the
    dimension is usually dropped in this situation)

### List

-   Special data structure that matrix could not handle.

    -   Data length are not the same.

    -   Data type are not the same.

    -   Nested data structure within a list.

-   Create a list: `list()`

-   Extract elements of a `list: [[]]` or `$`

-   Delete an element within a list: set `NULL` to that element.

### Data frame

-   `data.frame()`: tightly coupled collections of variables which share
    many of the properties of matrices and of lists, used as the
    fundamental data structure by most of R's modeling software.

-   In most cases, the operation with a data frame is similar to matrix
    operation.

-   See also `dplyr` package.

    -   written by Hadley Wickham of RStudio

    -   everything dplyr does could already be done with base R, but it
        greatly simplifies existing functionality in R.

    -   it provides a \"grammar\" (in particular, verbs) for data
        manipulation and for operating on data frames.

    -   the dplyr functions are very fast, as many key operations are
        coded in C++.

### Discussion

What type of data structure would you choose when you meet the following
situations.

-   Data are of the same length but different types.

-   Data are not of the same length.

-   Hierarchical structure of the data.

## Managing data frames with the `dplyr` package


### `dplyr` Grammar

Some of the key "verbs" provided by the `dplyr` package are

* `select`: return a subset of the columns of a data frame, using a flexible notation
* `filter`: extract a subset of rows from a data frame based on logical conditions
* `arrange`: reorder rows of a data frame
* `rename`: rename variables in a data frame
* `mutate`: add new variables/columns or transform existing variables
* `summarise` / `summarize`: generate summary statistics of different
  variables in the data frame, possibly within strata
* `%>%`: the "pipe" operator is used to connect multiple verb actions together into a pipeline


### Common `dplyr` Function Properties

All of the functions have a few common characteristics. In particular,

- The first argument is a data frame.

- The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the `$` operator (just use the column names).

- The return result of a function is a new data frame.

- Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.

- Installing the `dplyr` package 
            
        install.packages("dplyr")

- After installing the package it is important that you load it into your R session with the `library()` function.

In [9]:
library(dplyr)

### `select()`

- We will use a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S.

In [40]:
chicago <- readRDS("data/chicago.rds")
dim(chicago)
str(chicago)

'data.frame':	6940 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
 $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
 $ date      : Date, format: "1987-01-01" "1987-01-02" ...
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...


- Sometimes you may want to use only a couple of variables out of many.

In [14]:
names(chicago)[1:3]
subset <- select(chicago, city:dptp)
head(subset)

Unnamed: 0_level_0,city,tmpd,dptp
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,chic,31.5,31.5
2,chic,33.0,29.875
3,chic,33.0,27.375
4,chic,29.0,28.625
5,chic,32.0,28.875
6,chic,40.0,35.125


- Sometimes you may want to drop some variables that are not useful.


In [16]:
select(chicago, -(city:dptp))

Unnamed: 0_level_0,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
Unnamed: 0_level_1,<date>,<dbl>,<dbl>,<dbl>,<dbl>
1,1987-01-01,,34.00000,4.250000,19.98810
2,1987-01-02,,,3.304348,23.19099
3,1987-01-03,,34.16667,3.333333,23.81548
4,1987-01-04,,47.00000,4.375000,30.43452
5,1987-01-05,,,4.750000,30.33333
6,1987-01-06,,48.00000,5.833333,25.77233
7,1987-01-07,,41.00000,9.291667,20.58171
8,1987-01-08,,36.00000,11.291667,17.03723
9,1987-01-09,,33.28571,4.500000,23.38889
10,1987-01-10,,,4.958333,19.54167


- If you wanted to keep every variable that ends with a "2", we could do

In [18]:
subset <- select(chicago, ends_with("2"))
str(subset)

'data.frame':	6940 obs. of  4 variables:
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...


- Or if we wanted to keep every variable that starts with a "d", we could do

In [20]:
subset <- select(chicago, starts_with("d"))
str(subset)

'data.frame':	6940 obs. of  2 variables:
 $ dptp: num  31.5 29.9 27.4 28.6 28.9 ...
 $ date: Date, format: "1987-01-01" "1987-01-02" ...


### `filter()`

- The `filter()` function is used to extract subsets of rows from a data frame.


In [21]:
chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
summary(chic.f$pm25tmean2)

'data.frame':	194 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  23 28 55 59 57 57 75 61 73 78 ...
 $ dptp      : num  21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
 $ date      : Date, format: "1998-01-17" "1998-01-23" ...
 $ pm25tmean2: num  38.1 34 39.4 35.4 33.3 ...
 $ pm10tmean2: num  32.5 38.7 34 28.5 35 ...
 $ o3tmean2  : num  3.18 1.75 10.79 14.3 20.66 ...
 $ no2tmean2 : num  25.3 29.4 25.3 31.4 26.8 ...


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.05   32.12   35.04   36.63   39.53   61.50 

- We could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit.

In [22]:
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)

date,tmpd,pm25tmean2
<date>,<dbl>,<dbl>
1998-08-23,81,39.6
1998-09-06,81,31.5
2001-07-20,82,32.3
2001-08-01,84,43.7
2001-08-08,85,38.8375
2001-08-09,84,38.2
2002-06-20,82,33.0
2002-06-23,82,42.5
2002-07-08,81,33.1
2002-07-18,82,38.85


### `arrange()`

The `arrange()` function is used to reorder rows of a data frame according to one of the variables/columns.

- Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.

In [27]:
chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)
tail(select(chicago, date, pm25tmean2), 3)

Unnamed: 0_level_0,date,pm25tmean2
Unnamed: 0_level_1,<date>,<dbl>
1,1987-01-01,
2,1987-01-02,
3,1987-01-03,


Unnamed: 0_level_0,date,pm25tmean2
Unnamed: 0_level_1,<date>,<dbl>
6938,2005-12-29,7.45
6939,2005-12-30,15.05714
6940,2005-12-31,15.0


- Columns can be arranged in descending order too by useing the special `desc()` operator. Looking at the first three and last three rows shows the dates in descending order.

In [36]:
chicago2 <- arrange(chicago, desc(date))
head(select(chicago2, date, pm25tmean2), 3)
tail(select(chicago2, date, pm25tmean2), 3)

Unnamed: 0_level_0,date,pm25tmean2
Unnamed: 0_level_1,<date>,<dbl>
1,2005-12-31,15.0
2,2005-12-30,15.05714
3,2005-12-29,7.45


Unnamed: 0_level_0,date,pm25tmean2
Unnamed: 0_level_1,<date>,<dbl>
6938,1987-01-03,
6939,1987-01-02,
6940,1987-01-01,


### `rename()`

Renaming a variable in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.

- Here you can see the names of the first five variables in the `chicago` data frame. Now we rename the awkward variable names.


In [37]:
head(chicago[, 1:5], 3)
chicago3 <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago3[, 1:5], 3) # with new variable name

Unnamed: 0_level_0,city,tmpd,dptp,date,pm25tmean2
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<date>,<dbl>
1,chic,31.5,31.5,1987-01-01,
2,chic,33.0,29.875,1987-01-02,
3,chic,33.0,27.375,1987-01-03,


Unnamed: 0_level_0,city,tmpd,dewpoint,date,pm25
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<date>,<dbl>
1,chic,31.5,31.5,1987-01-01,
2,chic,33.0,29.875,1987-01-02,
3,chic,33.0,27.375,1987-01-03,


### `mutate()`

The `mutate()` function exists to compute transformations of variables in a data frame.

- For example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data.

In [42]:
chicago4 <- mutate(chicago3, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago4)

Unnamed: 0_level_0,city,tmpd,dewpoint,date,pm25,pm10tmean2,o3tmean2,no2tmean2,pm25detrend
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,chic,31.5,31.5,1987-01-01,,34.0,4.25,19.9881,
2,chic,33.0,29.875,1987-01-02,,,3.304348,23.19099,
3,chic,33.0,27.375,1987-01-03,,34.16667,3.333333,23.81548,
4,chic,29.0,28.625,1987-01-04,,47.0,4.375,30.43452,
5,chic,32.0,28.875,1987-01-05,,,4.75,30.33333,
6,chic,40.0,35.125,1987-01-06,,48.0,5.833333,25.77233,


### `group_by()`

The `group_by()` function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is.

- First, we can create a `year` variable using `as.POSIXlt()`.

- Now we can create a separate data frame that splits the original data frame by year.

In [45]:
chicago5 <- mutate(chicago3, year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago5, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
          o3 = max(o3tmean2, na.rm = TRUE),
          no2 = median(no2tmean2, na.rm = TRUE))

year,pm25,o3,no2
<dbl>,<dbl>,<dbl>,<dbl>
1987,,62.96966,23.49369
1988,,61.67708,24.52296
1989,,59.72727,26.14062
1990,,52.22917,22.59583
1991,,63.10417,21.38194
1992,,50.8287,24.78921
1993,,44.30093,25.76993
1994,,52.17844,28.475
1995,,66.5875,27.26042
1996,,58.39583,26.38715


In a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.

- First, we can create a categorical variable of `pm25` divided into quintiles.

- Now we can group the data frame by the `pm25.quint` variable.

- Finally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`.

In [46]:
qq <- quantile(chicago3$pm25, seq(0, 1, 0.2), na.rm = TRUE)
chicago6 <- mutate(chicago3, pm25.quint = cut(pm25, qq))
quint <- group_by(chicago6, pm25.quint)
summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),
          no2 = mean(no2tmean2, na.rm = TRUE))

pm25.quint,o3,no2
<fct>,<dbl>,<dbl>
"(1.7,8.7]",21.66401,17.99129
"(8.7,12.4]",20.38248,22.13004
"(12.4,16.7]",20.6616,24.35708
"(16.7,22.6]",19.88122,27.27132
"(22.6,61.5]",20.31775,29.64427
,18.79044,25.77585


### Summary

The `dplyr` package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.

* `dplyr` can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package

* `dplyr` can be integrated with the `data.table` package for large fast tables

* The `dplyr` package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!
