R Builtin Data Structures¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/statcomp

>>> Link to Python version 1, 2, 3

Generate a sequence¶

Sequences¶

  • Generate a sequence: seq()

  • Repeat a vector: rep()

In [2]:
seq(1,100)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
In [3]:
seq(100,1)
  1. 100
  2. 99
  3. 98
  4. 97
  5. 96
  6. 95
  7. 94
  8. 93
  9. 92
  10. 91
  11. 90
  12. 89
  13. 88
  14. 87
  15. 86
  16. 85
  17. 84
  18. 83
  19. 82
  20. 81
  21. 80
  22. 79
  23. 78
  24. 77
  25. 76
  26. 75
  27. 74
  28. 73
  29. 72
  30. 71
  31. 70
  32. 69
  33. 68
  34. 67
  35. 66
  36. 65
  37. 64
  38. 63
  39. 62
  40. 61
  41. 60
  42. 59
  43. 58
  44. 57
  45. 56
  46. 55
  47. 54
  48. 53
  49. 52
  50. 51
  51. 50
  52. 49
  53. 48
  54. 47
  55. 46
  56. 45
  57. 44
  58. 43
  59. 42
  60. 41
  61. 40
  62. 39
  63. 38
  64. 37
  65. 36
  66. 35
  67. 34
  68. 33
  69. 32
  70. 31
  71. 30
  72. 29
  73. 28
  74. 27
  75. 26
  76. 25
  77. 24
  78. 23
  79. 22
  80. 21
  81. 20
  82. 19
  83. 18
  84. 17
  85. 16
  86. 15
  87. 14
  88. 13
  89. 12
  90. 11
  91. 10
  92. 9
  93. 8
  94. 7
  95. 6
  96. 5
  97. 4
  98. 3
  99. 2
  100. 1
In [5]:
rep(10,5)
  1. 10
  2. 10
  3. 10
  4. 10
  5. 10
In [6]:
rep(c(1, 2, 3), 5)
  1. 1
  2. 2
  3. 3
  4. 1
  5. 2
  6. 3
  7. 1
  8. 2
  9. 3
  10. 1
  11. 2
  12. 3
  13. 1
  14. 2
  15. 3

Vectors¶

  • Numerical vectors

  • Logical vectors

  • Characters

  • Length of a vector

  • Vector calculations

Mathematical functions¶

  • sqrt(), log()

  • sin(),cos(), tan()

Vectors and matrices¶

Matrices¶

  • Create a matrix: matrix()

  • Dimension of a matrix: dim()

  • How many elements in a matrix: length()

  • Extract elements from a matrix.

  • Replace elements with new entries.

  • Create special matrices: diagonal matrix, identity matrix, zero matrix...

  • Matrix multiplications: %*%

  • Matrix inverse: solve()

  • Transpose of a matrix: t()

  • Element-wise operation with a matrix.

  • Combine two or more matrices: rbind(), cbind()

Array¶

  • An array is a high dimensional matrix.

  • A matrix is a special case of an array when the dimension is two.

  • A vector is a special array when their is no dimension (In R the dimension is usually dropped in this situation)

List¶

  • Special data structure that matrix could not handle.

    • Data length are not the same.

    • Data type are not the same.

    • Nested data structure within a list.

  • Create a list: list()

  • Extract elements of a list: [[]] or $

  • Delete an element within a list: set NULL to that element.

Data frame¶

  • data.frame(): tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.

  • In most cases, the operation with a data frame is similar to matrix operation.

  • See also dplyr package.

    • written by Hadley Wickham of RStudio

    • everything dplyr does could already be done with base R, but it greatly simplifies existing functionality in R.

    • it provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames.

    • the dplyr functions are very fast, as many key operations are coded in C++.

Discussion¶

What type of data structure would you choose when you meet the following situations.

  • Data are of the same length but different types.

  • Data are not of the same length.

  • Hierarchical structure of the data.

Managing data frames with the dplyr package¶

dplyr Grammar¶

Some of the key "verbs" provided by the dplyr package are

  • select: return a subset of the columns of a data frame, using a flexible notation
  • filter: extract a subset of rows from a data frame based on logical conditions
  • arrange: reorder rows of a data frame
  • rename: rename variables in a data frame
  • mutate: add new variables/columns or transform existing variables
  • summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata
  • %>%: the "pipe" operator is used to connect multiple verb actions together into a pipeline

Common dplyr Function Properties¶

All of the functions have a few common characteristics. In particular,

  • The first argument is a data frame.

  • The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names).

  • The return result of a function is a new data frame.

  • Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be tidy. In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.

  • Installing the dplyr package

      install.packages("dplyr")
  • After installing the package it is important that you load it into your R session with the library() function.

In [9]:
library(dplyr)

select()¶

  • We will use a dataset containing air pollution and temperature data for the city of Chicago in the U.S.
In [40]:
chicago <- readRDS("data/chicago.rds")
dim(chicago)
str(chicago)
  1. 6940
  2. 8
'data.frame':	6940 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
 $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
 $ date      : Date, format: "1987-01-01" "1987-01-02" ...
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...
  • Sometimes you may want to use only a couple of variables out of many.
In [14]:
names(chicago)[1:3]
subset <- select(chicago, city:dptp)
head(subset)
  1. 'city'
  2. 'tmpd'
  3. 'dptp'
A data.frame: 6 × 3
citytmpddptp
<chr><dbl><dbl>
1chic31.531.500
2chic33.029.875
3chic33.027.375
4chic29.028.625
5chic32.028.875
6chic40.035.125
  • Sometimes you may want to drop some variables that are not useful.
In [16]:
select(chicago, -(city:dptp))
A data.frame: 6940 × 5
datepm25tmean2pm10tmean2o3tmean2no2tmean2
<date><dbl><dbl><dbl><dbl>
11987-01-01NA34.00000 4.25000019.98810
21987-01-02NA NA 3.30434823.19099
31987-01-03NA34.16667 3.33333323.81548
41987-01-04NA47.00000 4.37500030.43452
51987-01-05NA NA 4.75000030.33333
61987-01-06NA48.00000 5.83333325.77233
71987-01-07NA41.00000 9.29166720.58171
81987-01-08NA36.0000011.29166717.03723
91987-01-09NA33.28571 4.50000023.38889
101987-01-10NA NA 4.95833319.54167
111987-01-11NA22.0000017.54166713.70139
121987-01-12NA26.00000 8.00000033.02083
131987-01-13NA53.00000 4.95833338.06142
141987-01-14NA43.00000 4.20833332.19444
151987-01-15NA28.83333 4.45833318.87131
161987-01-16NA19.00000 7.91666719.46667
171987-01-17NA NA 5.83333320.70833
181987-01-18NA39.00000 6.37500021.03333
191987-01-19NA32.0000014.87500017.17409
201987-01-20NA38.00000 7.25000021.61021
211987-01-21NA32.85714 8.91304324.52083
221987-01-22NA52.0000010.50000016.98798
231987-01-23NA55.0000014.62500014.66250
241987-01-24NA38.0000010.08333318.69167
251987-01-25NA NA 6.66666726.30417
261987-01-26NA71.00000 4.58333332.42143
271987-01-27NA39.33333 6.00000030.69306
281987-01-28NA47.00000 6.87500029.12943
291987-01-29NA35.00000 2.91666728.14529
301987-01-30NA59.00000 8.79166719.79861
⋮⋮⋮⋮⋮⋮
69112005-12-02 NA19.50 9.15625023.29167
69122005-12-0313.3428620.0010.33333325.19444
69132005-12-0415.3000015.5013.17708321.70833
69142005-12-05 NA30.00 6.44791728.38889
69152005-12-0624.6166733.00 4.70154029.08333
69162005-12-0737.8000039.00 3.91621434.30952
69172005-12-0824.3000031.00 5.99526534.22222
69182005-12-0925.4500022.00 5.95833331.41667
69192005-12-1018.2000030.00 9.13541728.70833
69202005-12-1110.6000014.0011.33333322.55556
69212005-12-1219.2250028.75 5.03125039.74621
69222005-12-1326.5000021.00 6.62862329.56944
69232005-12-1426.9000016.00 3.80208330.63384
69242005-12-1514.4000016.50 4.89583325.43056
69252005-12-1611.0000022.0011.16666716.87500
69262005-12-1713.8000020.00 8.59375020.73611
69272005-12-1812.2000017.5013.55208319.11111
69282005-12-1921.1500021.00 8.05887731.79167
69292005-12-2025.7500032.00 3.84918532.89773
69302005-12-2137.9285759.50 3.66394934.86111
69312005-12-2236.6500042.50 5.38541733.73026
69322005-12-2332.9000034.50 6.90625029.08333
69332005-12-2430.7714325.20 1.77083331.98611
69342005-12-25 6.70000 8.0014.35416713.79167
69352005-12-26 8.40000 8.5014.04166716.81944
69362005-12-2723.5600027.00 4.46875023.50000
69372005-12-2817.7500027.50 3.26041719.28563
69382005-12-29 7.4500023.50 6.79483719.97222
69392005-12-3015.0571419.20 3.03442022.80556
69402005-12-3115.0000023.50 2.53125013.25000
  • If you wanted to keep every variable that ends with a "2", we could do
In [18]:
subset <- select(chicago, ends_with("2"))
str(subset)
'data.frame':	6940 obs. of  4 variables:
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...
  • Or if we wanted to keep every variable that starts with a "d", we could do
In [20]:
subset <- select(chicago, starts_with("d"))
str(subset)
'data.frame':	6940 obs. of  2 variables:
 $ dptp: num  31.5 29.9 27.4 28.6 28.9 ...
 $ date: Date, format: "1987-01-01" "1987-01-02" ...

filter()¶

  • The filter() function is used to extract subsets of rows from a data frame.
In [21]:
chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
summary(chic.f$pm25tmean2)
'data.frame':	194 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  23 28 55 59 57 57 75 61 73 78 ...
 $ dptp      : num  21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
 $ date      : Date, format: "1998-01-17" "1998-01-23" ...
 $ pm25tmean2: num  38.1 34 39.4 35.4 33.3 ...
 $ pm10tmean2: num  32.5 38.7 34 28.5 35 ...
 $ o3tmean2  : num  3.18 1.75 10.79 14.3 20.66 ...
 $ no2tmean2 : num  25.3 29.4 25.3 31.4 26.8 ...
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.05   32.12   35.04   36.63   39.53   61.50 
  • We could for example extract the rows where PM2.5 is greater than 30 and temperature is greater than 80 degrees Fahrenheit.
In [22]:
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)
A data.frame: 17 × 3
datetmpdpm25tmean2
<date><dbl><dbl>
1998-08-238139.60000
1998-09-068131.50000
2001-07-208232.30000
2001-08-018443.70000
2001-08-088538.83750
2001-08-098438.20000
2002-06-208233.00000
2002-06-238242.50000
2002-07-088133.10000
2002-07-188238.85000
2003-06-258233.90000
2003-07-048432.90000
2005-06-248631.85714
2005-06-278251.53750
2005-06-288531.20000
2005-07-178432.70000
2005-08-038437.90000

arrange()¶

The arrange() function is used to reorder rows of a data frame according to one of the variables/columns.

  • Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.
In [27]:
chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)
tail(select(chicago, date, pm25tmean2), 3)
A data.frame: 3 × 2
datepm25tmean2
<date><dbl>
11987-01-01NA
21987-01-02NA
31987-01-03NA
A data.frame: 3 × 2
datepm25tmean2
<date><dbl>
69382005-12-29 7.45000
69392005-12-3015.05714
69402005-12-3115.00000
  • Columns can be arranged in descending order too by useing the special desc() operator. Looking at the first three and last three rows shows the dates in descending order.
In [36]:
chicago2 <- arrange(chicago, desc(date))
head(select(chicago2, date, pm25tmean2), 3)
tail(select(chicago2, date, pm25tmean2), 3)
A data.frame: 3 × 2
datepm25tmean2
<date><dbl>
12005-12-3115.00000
22005-12-3015.05714
32005-12-29 7.45000
A data.frame: 3 × 2
datepm25tmean2
<date><dbl>
69381987-01-03NA
69391987-01-02NA
69401987-01-01NA

rename()¶

Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function is designed to make this process easier.

  • Here you can see the names of the first five variables in the chicago data frame. Now we rename the awkward variable names.
In [37]:
head(chicago[, 1:5], 3)
chicago3 <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago3[, 1:5], 3) # with new variable name
A data.frame: 3 × 5
citytmpddptpdatepm25tmean2
<chr><dbl><dbl><date><dbl>
1chic31.531.5001987-01-01NA
2chic33.029.8751987-01-02NA
3chic33.027.3751987-01-03NA
A data.frame: 3 × 5
citytmpddewpointdatepm25
<chr><dbl><dbl><date><dbl>
1chic31.531.5001987-01-01NA
2chic33.029.8751987-01-02NA
3chic33.027.3751987-01-03NA

mutate()¶

The mutate() function exists to compute transformations of variables in a data frame.

  • For example, with air pollution data, we often want to detrend the data by subtracting the mean from the data.
In [42]:
chicago4 <- mutate(chicago3, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago4)
A data.frame: 6 × 9
citytmpddewpointdatepm25pm10tmean2o3tmean2no2tmean2pm25detrend
<chr><dbl><dbl><date><dbl><dbl><dbl><dbl><dbl>
1chic31.531.5001987-01-01NA34.000004.25000019.98810NA
2chic33.029.8751987-01-02NA NA3.30434823.19099NA
3chic33.027.3751987-01-03NA34.166673.33333323.81548NA
4chic29.028.6251987-01-04NA47.000004.37500030.43452NA
5chic32.028.8751987-01-05NA NA4.75000030.33333NA
6chic40.035.1251987-01-06NA48.000005.83333325.77233NA

group_by()¶

The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is.

  • First, we can create a year variable using as.POSIXlt().

  • Now we can create a separate data frame that splits the original data frame by year.

In [45]:
chicago5 <- mutate(chicago3, year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago5, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
          o3 = max(o3tmean2, na.rm = TRUE),
          no2 = median(no2tmean2, na.rm = TRUE))
A tibble: 19 × 4
yearpm25o3no2
<dbl><dbl><dbl><dbl>
1987 NaN62.9696623.49369
1988 NaN61.6770824.52296
1989 NaN59.7272726.14062
1990 NaN52.2291722.59583
1991 NaN63.1041721.38194
1992 NaN50.8287024.78921
1993 NaN44.3009325.76993
1994 NaN52.1784428.47500
1995 NaN66.5875027.26042
1996 NaN58.3958326.38715
1997 NaN56.5416725.48143
199818.2646750.6625024.58649
199918.4964657.4886424.66667
200016.9380655.7610323.46082
200116.9263251.8198425.06522
200215.2733554.8804322.73750
200315.2318356.1660824.62500
200414.6286444.4824023.39130
200516.1855658.8412622.62387

In a slightly more complicated example, we might want to know what are the average levels of ozone (o3) and nitrogen dioxide (no2) within quintiles of pm25. A slicker way to do this would be through a regression model, but we can actually do this quickly with group_by() and summarize().

  • First, we can create a categorical variable of pm25 divided into quintiles.

  • Now we can group the data frame by the pm25.quint variable.

  • Finally, we can compute the mean of o3 and no2 within quintiles of pm25.

In [46]:
qq <- quantile(chicago3$pm25, seq(0, 1, 0.2), na.rm = TRUE)
chicago6 <- mutate(chicago3, pm25.quint = cut(pm25, qq))
quint <- group_by(chicago6, pm25.quint)
summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),
          no2 = mean(no2tmean2, na.rm = TRUE))
A tibble: 6 × 3
pm25.quinto3no2
<fct><dbl><dbl>
(1.7,8.7] 21.6640117.99129
(8.7,12.4] 20.3824822.13004
(12.4,16.7]20.6616024.35708
(16.7,22.6]19.8812227.27132
(22.6,61.5]20.3177529.64427
NA 18.7904425.77585

Summary¶

The dplyr package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of group_by() and summarize().

  • dplyr can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package

  • dplyr can be integrated with the data.table package for large fast tables

  • The dplyr package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!