{
"cells": [
{
"cell_type": "markdown",
"id": "c57b2ce5",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# R Builtin Data Structures\n",
"\n",
"\n",
"Feng Li\n",
"\n",
"School of Statistics and Mathematics\n",
"\n",
"Central University of Finance and Economics\n",
"\n",
"[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n",
"\n",
"[https://feng.li/statcomp](https://feng.li/statcomp)\n",
"\n",
"_>>> Link to Python version_ [1](https://feng.li/files/python/P02-Python-Data-Structures/L02.1-Python-Builtin-Data-Structures.slides.html), [2](https://feng.li/files/python/P02-Python-Data-Structures/L02.2-Data-Wrangling-with-Pandas.slides.html), [3](https://feng.li/files/python/P02-Python-Data-Structures/L02.3-Manipulating-DataFrames-with-Pandas.slides.html)"
]
},
{
"cell_type": "markdown",
"id": "a0af6f40",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Generate a sequence\n",
"\n",
"### Sequences\n",
"\n",
"- Generate a sequence: `seq()`\n",
"\n",
"- Repeat a vector: `rep()`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "776c00b1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1\n",
"\\item 2\n",
"\\item 3\n",
"\\item 4\n",
"\\item 5\n",
"\\item 6\n",
"\\item 7\n",
"\\item 8\n",
"\\item 9\n",
"\\item 10\n",
"\\item 11\n",
"\\item 12\n",
"\\item 13\n",
"\\item 14\n",
"\\item 15\n",
"\\item 16\n",
"\\item 17\n",
"\\item 18\n",
"\\item 19\n",
"\\item 20\n",
"\\item 21\n",
"\\item 22\n",
"\\item 23\n",
"\\item 24\n",
"\\item 25\n",
"\\item 26\n",
"\\item 27\n",
"\\item 28\n",
"\\item 29\n",
"\\item 30\n",
"\\item 31\n",
"\\item 32\n",
"\\item 33\n",
"\\item 34\n",
"\\item 35\n",
"\\item 36\n",
"\\item 37\n",
"\\item 38\n",
"\\item 39\n",
"\\item 40\n",
"\\item 41\n",
"\\item 42\n",
"\\item 43\n",
"\\item 44\n",
"\\item 45\n",
"\\item 46\n",
"\\item 47\n",
"\\item 48\n",
"\\item 49\n",
"\\item 50\n",
"\\item 51\n",
"\\item 52\n",
"\\item 53\n",
"\\item 54\n",
"\\item 55\n",
"\\item 56\n",
"\\item 57\n",
"\\item 58\n",
"\\item 59\n",
"\\item 60\n",
"\\item 61\n",
"\\item 62\n",
"\\item 63\n",
"\\item 64\n",
"\\item 65\n",
"\\item 66\n",
"\\item 67\n",
"\\item 68\n",
"\\item 69\n",
"\\item 70\n",
"\\item 71\n",
"\\item 72\n",
"\\item 73\n",
"\\item 74\n",
"\\item 75\n",
"\\item 76\n",
"\\item 77\n",
"\\item 78\n",
"\\item 79\n",
"\\item 80\n",
"\\item 81\n",
"\\item 82\n",
"\\item 83\n",
"\\item 84\n",
"\\item 85\n",
"\\item 86\n",
"\\item 87\n",
"\\item 88\n",
"\\item 89\n",
"\\item 90\n",
"\\item 91\n",
"\\item 92\n",
"\\item 93\n",
"\\item 94\n",
"\\item 95\n",
"\\item 96\n",
"\\item 97\n",
"\\item 98\n",
"\\item 99\n",
"\\item 100\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1\n",
"2. 2\n",
"3. 3\n",
"4. 4\n",
"5. 5\n",
"6. 6\n",
"7. 7\n",
"8. 8\n",
"9. 9\n",
"10. 10\n",
"11. 11\n",
"12. 12\n",
"13. 13\n",
"14. 14\n",
"15. 15\n",
"16. 16\n",
"17. 17\n",
"18. 18\n",
"19. 19\n",
"20. 20\n",
"21. 21\n",
"22. 22\n",
"23. 23\n",
"24. 24\n",
"25. 25\n",
"26. 26\n",
"27. 27\n",
"28. 28\n",
"29. 29\n",
"30. 30\n",
"31. 31\n",
"32. 32\n",
"33. 33\n",
"34. 34\n",
"35. 35\n",
"36. 36\n",
"37. 37\n",
"38. 38\n",
"39. 39\n",
"40. 40\n",
"41. 41\n",
"42. 42\n",
"43. 43\n",
"44. 44\n",
"45. 45\n",
"46. 46\n",
"47. 47\n",
"48. 48\n",
"49. 49\n",
"50. 50\n",
"51. 51\n",
"52. 52\n",
"53. 53\n",
"54. 54\n",
"55. 55\n",
"56. 56\n",
"57. 57\n",
"58. 58\n",
"59. 59\n",
"60. 60\n",
"61. 61\n",
"62. 62\n",
"63. 63\n",
"64. 64\n",
"65. 65\n",
"66. 66\n",
"67. 67\n",
"68. 68\n",
"69. 69\n",
"70. 70\n",
"71. 71\n",
"72. 72\n",
"73. 73\n",
"74. 74\n",
"75. 75\n",
"76. 76\n",
"77. 77\n",
"78. 78\n",
"79. 79\n",
"80. 80\n",
"81. 81\n",
"82. 82\n",
"83. 83\n",
"84. 84\n",
"85. 85\n",
"86. 86\n",
"87. 87\n",
"88. 88\n",
"89. 89\n",
"90. 90\n",
"91. 91\n",
"92. 92\n",
"93. 93\n",
"94. 94\n",
"95. 95\n",
"96. 96\n",
"97. 97\n",
"98. 98\n",
"99. 99\n",
"100. 100\n",
"\n",
"\n"
],
"text/plain": [
" [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18\n",
" [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36\n",
" [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54\n",
" [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72\n",
" [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90\n",
" [91] 91 92 93 94 95 96 97 98 99 100"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"seq(1,100)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3f52bc96",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 100
- 99
- 98
- 97
- 96
- 95
- 94
- 93
- 92
- 91
- 90
- 89
- 88
- 87
- 86
- 85
- 84
- 83
- 82
- 81
- 80
- 79
- 78
- 77
- 76
- 75
- 74
- 73
- 72
- 71
- 70
- 69
- 68
- 67
- 66
- 65
- 64
- 63
- 62
- 61
- 60
- 59
- 58
- 57
- 56
- 55
- 54
- 53
- 52
- 51
- 50
- 49
- 48
- 47
- 46
- 45
- 44
- 43
- 42
- 41
- 40
- 39
- 38
- 37
- 36
- 35
- 34
- 33
- 32
- 31
- 30
- 29
- 28
- 27
- 26
- 25
- 24
- 23
- 22
- 21
- 20
- 19
- 18
- 17
- 16
- 15
- 14
- 13
- 12
- 11
- 10
- 9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 100\n",
"\\item 99\n",
"\\item 98\n",
"\\item 97\n",
"\\item 96\n",
"\\item 95\n",
"\\item 94\n",
"\\item 93\n",
"\\item 92\n",
"\\item 91\n",
"\\item 90\n",
"\\item 89\n",
"\\item 88\n",
"\\item 87\n",
"\\item 86\n",
"\\item 85\n",
"\\item 84\n",
"\\item 83\n",
"\\item 82\n",
"\\item 81\n",
"\\item 80\n",
"\\item 79\n",
"\\item 78\n",
"\\item 77\n",
"\\item 76\n",
"\\item 75\n",
"\\item 74\n",
"\\item 73\n",
"\\item 72\n",
"\\item 71\n",
"\\item 70\n",
"\\item 69\n",
"\\item 68\n",
"\\item 67\n",
"\\item 66\n",
"\\item 65\n",
"\\item 64\n",
"\\item 63\n",
"\\item 62\n",
"\\item 61\n",
"\\item 60\n",
"\\item 59\n",
"\\item 58\n",
"\\item 57\n",
"\\item 56\n",
"\\item 55\n",
"\\item 54\n",
"\\item 53\n",
"\\item 52\n",
"\\item 51\n",
"\\item 50\n",
"\\item 49\n",
"\\item 48\n",
"\\item 47\n",
"\\item 46\n",
"\\item 45\n",
"\\item 44\n",
"\\item 43\n",
"\\item 42\n",
"\\item 41\n",
"\\item 40\n",
"\\item 39\n",
"\\item 38\n",
"\\item 37\n",
"\\item 36\n",
"\\item 35\n",
"\\item 34\n",
"\\item 33\n",
"\\item 32\n",
"\\item 31\n",
"\\item 30\n",
"\\item 29\n",
"\\item 28\n",
"\\item 27\n",
"\\item 26\n",
"\\item 25\n",
"\\item 24\n",
"\\item 23\n",
"\\item 22\n",
"\\item 21\n",
"\\item 20\n",
"\\item 19\n",
"\\item 18\n",
"\\item 17\n",
"\\item 16\n",
"\\item 15\n",
"\\item 14\n",
"\\item 13\n",
"\\item 12\n",
"\\item 11\n",
"\\item 10\n",
"\\item 9\n",
"\\item 8\n",
"\\item 7\n",
"\\item 6\n",
"\\item 5\n",
"\\item 4\n",
"\\item 3\n",
"\\item 2\n",
"\\item 1\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 100\n",
"2. 99\n",
"3. 98\n",
"4. 97\n",
"5. 96\n",
"6. 95\n",
"7. 94\n",
"8. 93\n",
"9. 92\n",
"10. 91\n",
"11. 90\n",
"12. 89\n",
"13. 88\n",
"14. 87\n",
"15. 86\n",
"16. 85\n",
"17. 84\n",
"18. 83\n",
"19. 82\n",
"20. 81\n",
"21. 80\n",
"22. 79\n",
"23. 78\n",
"24. 77\n",
"25. 76\n",
"26. 75\n",
"27. 74\n",
"28. 73\n",
"29. 72\n",
"30. 71\n",
"31. 70\n",
"32. 69\n",
"33. 68\n",
"34. 67\n",
"35. 66\n",
"36. 65\n",
"37. 64\n",
"38. 63\n",
"39. 62\n",
"40. 61\n",
"41. 60\n",
"42. 59\n",
"43. 58\n",
"44. 57\n",
"45. 56\n",
"46. 55\n",
"47. 54\n",
"48. 53\n",
"49. 52\n",
"50. 51\n",
"51. 50\n",
"52. 49\n",
"53. 48\n",
"54. 47\n",
"55. 46\n",
"56. 45\n",
"57. 44\n",
"58. 43\n",
"59. 42\n",
"60. 41\n",
"61. 40\n",
"62. 39\n",
"63. 38\n",
"64. 37\n",
"65. 36\n",
"66. 35\n",
"67. 34\n",
"68. 33\n",
"69. 32\n",
"70. 31\n",
"71. 30\n",
"72. 29\n",
"73. 28\n",
"74. 27\n",
"75. 26\n",
"76. 25\n",
"77. 24\n",
"78. 23\n",
"79. 22\n",
"80. 21\n",
"81. 20\n",
"82. 19\n",
"83. 18\n",
"84. 17\n",
"85. 16\n",
"86. 15\n",
"87. 14\n",
"88. 13\n",
"89. 12\n",
"90. 11\n",
"91. 10\n",
"92. 9\n",
"93. 8\n",
"94. 7\n",
"95. 6\n",
"96. 5\n",
"97. 4\n",
"98. 3\n",
"99. 2\n",
"100. 1\n",
"\n",
"\n"
],
"text/plain": [
" [1] 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83\n",
" [19] 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65\n",
" [37] 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47\n",
" [55] 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29\n",
" [73] 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11\n",
" [91] 10 9 8 7 6 5 4 3 2 1"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"seq(100,1)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "daacea4d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 10
- 10
- 10
- 10
- 10
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 10\n",
"\\item 10\n",
"\\item 10\n",
"\\item 10\n",
"\\item 10\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 10\n",
"2. 10\n",
"3. 10\n",
"4. 10\n",
"5. 10\n",
"\n",
"\n"
],
"text/plain": [
"[1] 10 10 10 10 10"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rep(10,5)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5423c9cb",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 1
- 2
- 3
- 1
- 2
- 3
- 1
- 2
- 3
- 1
- 2
- 3
- 1
- 2
- 3
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1\n",
"\\item 2\n",
"\\item 3\n",
"\\item 1\n",
"\\item 2\n",
"\\item 3\n",
"\\item 1\n",
"\\item 2\n",
"\\item 3\n",
"\\item 1\n",
"\\item 2\n",
"\\item 3\n",
"\\item 1\n",
"\\item 2\n",
"\\item 3\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1\n",
"2. 2\n",
"3. 3\n",
"4. 1\n",
"5. 2\n",
"6. 3\n",
"7. 1\n",
"8. 2\n",
"9. 3\n",
"10. 1\n",
"11. 2\n",
"12. 3\n",
"13. 1\n",
"14. 2\n",
"15. 3\n",
"\n",
"\n"
],
"text/plain": [
" [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rep(c(1, 2, 3), 5)"
]
},
{
"cell_type": "markdown",
"id": "3a8613a0",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Vectors\n",
"\n",
"- Numerical vectors\n",
"\n",
"- Logical vectors\n",
"\n",
"- Characters\n",
"\n",
"- Length of a vector\n",
"\n",
"- Vector calculations"
]
},
{
"cell_type": "markdown",
"id": "3ed2c522",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Mathematical functions\n",
"\n",
"- `sqrt(), log()`\n",
"\n",
"- `sin(),cos(), tan()`"
]
},
{
"cell_type": "markdown",
"id": "7943100a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Vectors and matrices\n",
"\n",
"### Matrices\n",
"\n",
"- Create a matrix: `matrix()`\n",
"\n",
"- Dimension of a matrix: `dim()`\n",
"\n",
"- How many elements in a matrix: `length()`\n",
"\n",
"- Extract elements from a matrix.\n",
"\n",
"- Replace elements with new entries.\n",
"\n",
"- Create special matrices: diagonal matrix, identity matrix, zero\n",
" matrix\\..."
]
},
{
"cell_type": "markdown",
"id": "d471849b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Matrix multiplications: `%*%`\n",
"\n",
"- Matrix inverse: `solve()`\n",
"\n",
"- Transpose of a matrix: `t()`\n",
"\n",
"- Element-wise operation with a matrix.\n",
"\n",
"- Combine two or more matrices: `rbind(), cbind()`"
]
},
{
"cell_type": "markdown",
"id": "75033dc2",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Array\n",
"\n",
"- An array is a high dimensional matrix.\n",
"\n",
"- A matrix is a special case of an array when the dimension is two.\n",
"\n",
"- A vector is a special array when their is no dimension (In R the\n",
" dimension is usually dropped in this situation)"
]
},
{
"cell_type": "markdown",
"id": "f824b151",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### List\n",
"\n",
"- Special data structure that matrix could not handle.\n",
"\n",
" - Data length are not the same.\n",
"\n",
" - Data type are not the same.\n",
"\n",
" - Nested data structure within a list.\n",
"\n",
"- Create a list: `list()`\n",
"\n",
"- Extract elements of a `list: [[]]` or `$`\n",
"\n",
"- Delete an element within a list: set `NULL` to that element."
]
},
{
"cell_type": "markdown",
"id": "c5a0ff2a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Data frame\n",
"\n",
"- `data.frame()`: tightly coupled collections of variables which share\n",
" many of the properties of matrices and of lists, used as the\n",
" fundamental data structure by most of R's modeling software.\n",
"\n",
"- In most cases, the operation with a data frame is similar to matrix\n",
" operation."
]
},
{
"cell_type": "markdown",
"id": "55ff6fdb",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- See also `dplyr` package.\n",
"\n",
" - written by Hadley Wickham of RStudio\n",
"\n",
" - everything dplyr does could already be done with base R, but it\n",
" greatly simplifies existing functionality in R.\n",
"\n",
" - it provides a \\\"grammar\\\" (in particular, verbs) for data\n",
" manipulation and for operating on data frames.\n",
"\n",
" - the dplyr functions are very fast, as many key operations are\n",
" coded in C++."
]
},
{
"cell_type": "markdown",
"id": "67c9468d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Discussion\n",
"\n",
"What type of data structure would you choose when you meet the following\n",
"situations.\n",
"\n",
"- Data are of the same length but different types.\n",
"\n",
"- Data are not of the same length.\n",
"\n",
"- Hierarchical structure of the data."
]
},
{
"cell_type": "markdown",
"id": "651e3e65",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Managing data frames with the `dplyr` package\n",
"\n",
"\n",
"### `dplyr` Grammar\n",
"\n",
"Some of the key \"verbs\" provided by the `dplyr` package are\n",
"\n",
"* `select`: return a subset of the columns of a data frame, using a flexible notation\n",
"* `filter`: extract a subset of rows from a data frame based on logical conditions\n",
"* `arrange`: reorder rows of a data frame\n",
"* `rename`: rename variables in a data frame\n",
"* `mutate`: add new variables/columns or transform existing variables\n",
"* `summarise` / `summarize`: generate summary statistics of different\n",
" variables in the data frame, possibly within strata\n",
"* `%>%`: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n"
]
},
{
"cell_type": "markdown",
"id": "f30cd3a0",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Common `dplyr` Function Properties\n",
"\n",
"All of the functions have a few common characteristics. In particular,\n",
"\n",
"- The first argument is a data frame.\n",
"\n",
"- The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the `$` operator (just use the column names).\n",
"\n",
"- The return result of a function is a new data frame.\n",
"\n",
"- Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation."
]
},
{
"cell_type": "markdown",
"id": "4e004ab4",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Installing the `dplyr` package \n",
" \n",
" install.packages(\"dplyr\")\n",
"\n",
"- After installing the package it is important that you load it into your R session with the `library()` function."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "47c53ff8",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"library(dplyr)"
]
},
{
"cell_type": "markdown",
"id": "4dcfd723",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `select()`\n",
"\n",
"- We will use a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S."
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "34c35340",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 6940
- 8
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 6940\n",
"\\item 8\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 6940\n",
"2. 8\n",
"\n",
"\n"
],
"text/plain": [
"[1] 6940 8"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"'data.frame':\t6940 obs. of 8 variables:\n",
" $ city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n",
" $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n",
" $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...\n",
" $ date : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n",
" $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n",
" $ pm10tmean2: num 34 NA 34.2 47 NA ...\n",
" $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n",
" $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n"
]
}
],
"source": [
"chicago <- readRDS(\"data/chicago.rds\")\n",
"dim(chicago)\n",
"str(chicago)"
]
},
{
"cell_type": "markdown",
"id": "292e3983",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Sometimes you may want to use only a couple of variables out of many."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "59939174",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 'city'
- 'tmpd'
- 'dptp'
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'city'\n",
"\\item 'tmpd'\n",
"\\item 'dptp'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'city'\n",
"2. 'tmpd'\n",
"3. 'dptp'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"city\" \"tmpd\" \"dptp\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"A data.frame: 6 × 3\n",
"\n",
"\t | city | tmpd | dptp |
\n",
"\t | <chr> | <dbl> | <dbl> |
\n",
"\n",
"\n",
"\t1 | chic | 31.5 | 31.500 |
\n",
"\t2 | chic | 33.0 | 29.875 |
\n",
"\t3 | chic | 33.0 | 27.375 |
\n",
"\t4 | chic | 29.0 | 28.625 |
\n",
"\t5 | chic | 32.0 | 28.875 |
\n",
"\t6 | chic | 40.0 | 35.125 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 6 × 3\n",
"\\begin{tabular}{r|lll}\n",
" & city & tmpd & dptp\\\\\n",
" & & & \\\\\n",
"\\hline\n",
"\t1 & chic & 31.5 & 31.500\\\\\n",
"\t2 & chic & 33.0 & 29.875\\\\\n",
"\t3 & chic & 33.0 & 27.375\\\\\n",
"\t4 & chic & 29.0 & 28.625\\\\\n",
"\t5 & chic & 32.0 & 28.875\\\\\n",
"\t6 & chic & 40.0 & 35.125\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 6 × 3\n",
"\n",
"| | city <chr> | tmpd <dbl> | dptp <dbl> |\n",
"|---|---|---|---|\n",
"| 1 | chic | 31.5 | 31.500 |\n",
"| 2 | chic | 33.0 | 29.875 |\n",
"| 3 | chic | 33.0 | 27.375 |\n",
"| 4 | chic | 29.0 | 28.625 |\n",
"| 5 | chic | 32.0 | 28.875 |\n",
"| 6 | chic | 40.0 | 35.125 |\n",
"\n"
],
"text/plain": [
" city tmpd dptp \n",
"1 chic 31.5 31.500\n",
"2 chic 33.0 29.875\n",
"3 chic 33.0 27.375\n",
"4 chic 29.0 28.625\n",
"5 chic 32.0 28.875\n",
"6 chic 40.0 35.125"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"names(chicago)[1:3]\n",
"subset <- select(chicago, city:dptp)\n",
"head(subset)"
]
},
{
"cell_type": "markdown",
"id": "1e153a64",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Sometimes you may want to drop some variables that are not useful.\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "af2e4c75",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A data.frame: 6940 × 5\n",
"\n",
"\t | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
\n",
"\t | <date> | <dbl> | <dbl> | <dbl> | <dbl> |
\n",
"\n",
"\n",
"\t1 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
\n",
"\t2 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |
\n",
"\t3 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
\n",
"\t4 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |
\n",
"\t5 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 |
\n",
"\t6 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |
\n",
"\t7 | 1987-01-07 | NA | 41.00000 | 9.291667 | 20.58171 |
\n",
"\t8 | 1987-01-08 | NA | 36.00000 | 11.291667 | 17.03723 |
\n",
"\t9 | 1987-01-09 | NA | 33.28571 | 4.500000 | 23.38889 |
\n",
"\t10 | 1987-01-10 | NA | NA | 4.958333 | 19.54167 |
\n",
"\t11 | 1987-01-11 | NA | 22.00000 | 17.541667 | 13.70139 |
\n",
"\t12 | 1987-01-12 | NA | 26.00000 | 8.000000 | 33.02083 |
\n",
"\t13 | 1987-01-13 | NA | 53.00000 | 4.958333 | 38.06142 |
\n",
"\t14 | 1987-01-14 | NA | 43.00000 | 4.208333 | 32.19444 |
\n",
"\t15 | 1987-01-15 | NA | 28.83333 | 4.458333 | 18.87131 |
\n",
"\t16 | 1987-01-16 | NA | 19.00000 | 7.916667 | 19.46667 |
\n",
"\t17 | 1987-01-17 | NA | NA | 5.833333 | 20.70833 |
\n",
"\t18 | 1987-01-18 | NA | 39.00000 | 6.375000 | 21.03333 |
\n",
"\t19 | 1987-01-19 | NA | 32.00000 | 14.875000 | 17.17409 |
\n",
"\t20 | 1987-01-20 | NA | 38.00000 | 7.250000 | 21.61021 |
\n",
"\t21 | 1987-01-21 | NA | 32.85714 | 8.913043 | 24.52083 |
\n",
"\t22 | 1987-01-22 | NA | 52.00000 | 10.500000 | 16.98798 |
\n",
"\t23 | 1987-01-23 | NA | 55.00000 | 14.625000 | 14.66250 |
\n",
"\t24 | 1987-01-24 | NA | 38.00000 | 10.083333 | 18.69167 |
\n",
"\t25 | 1987-01-25 | NA | NA | 6.666667 | 26.30417 |
\n",
"\t26 | 1987-01-26 | NA | 71.00000 | 4.583333 | 32.42143 |
\n",
"\t27 | 1987-01-27 | NA | 39.33333 | 6.000000 | 30.69306 |
\n",
"\t28 | 1987-01-28 | NA | 47.00000 | 6.875000 | 29.12943 |
\n",
"\t29 | 1987-01-29 | NA | 35.00000 | 2.916667 | 28.14529 |
\n",
"\t30 | 1987-01-30 | NA | 59.00000 | 8.791667 | 19.79861 |
\n",
"\t⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
\n",
"\t6911 | 2005-12-02 | NA | 19.50 | 9.156250 | 23.29167 |
\n",
"\t6912 | 2005-12-03 | 13.34286 | 20.00 | 10.333333 | 25.19444 |
\n",
"\t6913 | 2005-12-04 | 15.30000 | 15.50 | 13.177083 | 21.70833 |
\n",
"\t6914 | 2005-12-05 | NA | 30.00 | 6.447917 | 28.38889 |
\n",
"\t6915 | 2005-12-06 | 24.61667 | 33.00 | 4.701540 | 29.08333 |
\n",
"\t6916 | 2005-12-07 | 37.80000 | 39.00 | 3.916214 | 34.30952 |
\n",
"\t6917 | 2005-12-08 | 24.30000 | 31.00 | 5.995265 | 34.22222 |
\n",
"\t6918 | 2005-12-09 | 25.45000 | 22.00 | 5.958333 | 31.41667 |
\n",
"\t6919 | 2005-12-10 | 18.20000 | 30.00 | 9.135417 | 28.70833 |
\n",
"\t6920 | 2005-12-11 | 10.60000 | 14.00 | 11.333333 | 22.55556 |
\n",
"\t6921 | 2005-12-12 | 19.22500 | 28.75 | 5.031250 | 39.74621 |
\n",
"\t6922 | 2005-12-13 | 26.50000 | 21.00 | 6.628623 | 29.56944 |
\n",
"\t6923 | 2005-12-14 | 26.90000 | 16.00 | 3.802083 | 30.63384 |
\n",
"\t6924 | 2005-12-15 | 14.40000 | 16.50 | 4.895833 | 25.43056 |
\n",
"\t6925 | 2005-12-16 | 11.00000 | 22.00 | 11.166667 | 16.87500 |
\n",
"\t6926 | 2005-12-17 | 13.80000 | 20.00 | 8.593750 | 20.73611 |
\n",
"\t6927 | 2005-12-18 | 12.20000 | 17.50 | 13.552083 | 19.11111 |
\n",
"\t6928 | 2005-12-19 | 21.15000 | 21.00 | 8.058877 | 31.79167 |
\n",
"\t6929 | 2005-12-20 | 25.75000 | 32.00 | 3.849185 | 32.89773 |
\n",
"\t6930 | 2005-12-21 | 37.92857 | 59.50 | 3.663949 | 34.86111 |
\n",
"\t6931 | 2005-12-22 | 36.65000 | 42.50 | 5.385417 | 33.73026 |
\n",
"\t6932 | 2005-12-23 | 32.90000 | 34.50 | 6.906250 | 29.08333 |
\n",
"\t6933 | 2005-12-24 | 30.77143 | 25.20 | 1.770833 | 31.98611 |
\n",
"\t6934 | 2005-12-25 | 6.70000 | 8.00 | 14.354167 | 13.79167 |
\n",
"\t6935 | 2005-12-26 | 8.40000 | 8.50 | 14.041667 | 16.81944 |
\n",
"\t6936 | 2005-12-27 | 23.56000 | 27.00 | 4.468750 | 23.50000 |
\n",
"\t6937 | 2005-12-28 | 17.75000 | 27.50 | 3.260417 | 19.28563 |
\n",
"\t6938 | 2005-12-29 | 7.45000 | 23.50 | 6.794837 | 19.97222 |
\n",
"\t6939 | 2005-12-30 | 15.05714 | 19.20 | 3.034420 | 22.80556 |
\n",
"\t6940 | 2005-12-31 | 15.00000 | 23.50 | 2.531250 | 13.25000 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 6940 × 5\n",
"\\begin{tabular}{r|lllll}\n",
" & date & pm25tmean2 & pm10tmean2 & o3tmean2 & no2tmean2\\\\\n",
" & & & & & \\\\\n",
"\\hline\n",
"\t1 & 1987-01-01 & NA & 34.00000 & 4.250000 & 19.98810\\\\\n",
"\t2 & 1987-01-02 & NA & NA & 3.304348 & 23.19099\\\\\n",
"\t3 & 1987-01-03 & NA & 34.16667 & 3.333333 & 23.81548\\\\\n",
"\t4 & 1987-01-04 & NA & 47.00000 & 4.375000 & 30.43452\\\\\n",
"\t5 & 1987-01-05 & NA & NA & 4.750000 & 30.33333\\\\\n",
"\t6 & 1987-01-06 & NA & 48.00000 & 5.833333 & 25.77233\\\\\n",
"\t7 & 1987-01-07 & NA & 41.00000 & 9.291667 & 20.58171\\\\\n",
"\t8 & 1987-01-08 & NA & 36.00000 & 11.291667 & 17.03723\\\\\n",
"\t9 & 1987-01-09 & NA & 33.28571 & 4.500000 & 23.38889\\\\\n",
"\t10 & 1987-01-10 & NA & NA & 4.958333 & 19.54167\\\\\n",
"\t11 & 1987-01-11 & NA & 22.00000 & 17.541667 & 13.70139\\\\\n",
"\t12 & 1987-01-12 & NA & 26.00000 & 8.000000 & 33.02083\\\\\n",
"\t13 & 1987-01-13 & NA & 53.00000 & 4.958333 & 38.06142\\\\\n",
"\t14 & 1987-01-14 & NA & 43.00000 & 4.208333 & 32.19444\\\\\n",
"\t15 & 1987-01-15 & NA & 28.83333 & 4.458333 & 18.87131\\\\\n",
"\t16 & 1987-01-16 & NA & 19.00000 & 7.916667 & 19.46667\\\\\n",
"\t17 & 1987-01-17 & NA & NA & 5.833333 & 20.70833\\\\\n",
"\t18 & 1987-01-18 & NA & 39.00000 & 6.375000 & 21.03333\\\\\n",
"\t19 & 1987-01-19 & NA & 32.00000 & 14.875000 & 17.17409\\\\\n",
"\t20 & 1987-01-20 & NA & 38.00000 & 7.250000 & 21.61021\\\\\n",
"\t21 & 1987-01-21 & NA & 32.85714 & 8.913043 & 24.52083\\\\\n",
"\t22 & 1987-01-22 & NA & 52.00000 & 10.500000 & 16.98798\\\\\n",
"\t23 & 1987-01-23 & NA & 55.00000 & 14.625000 & 14.66250\\\\\n",
"\t24 & 1987-01-24 & NA & 38.00000 & 10.083333 & 18.69167\\\\\n",
"\t25 & 1987-01-25 & NA & NA & 6.666667 & 26.30417\\\\\n",
"\t26 & 1987-01-26 & NA & 71.00000 & 4.583333 & 32.42143\\\\\n",
"\t27 & 1987-01-27 & NA & 39.33333 & 6.000000 & 30.69306\\\\\n",
"\t28 & 1987-01-28 & NA & 47.00000 & 6.875000 & 29.12943\\\\\n",
"\t29 & 1987-01-29 & NA & 35.00000 & 2.916667 & 28.14529\\\\\n",
"\t30 & 1987-01-30 & NA & 59.00000 & 8.791667 & 19.79861\\\\\n",
"\t⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮\\\\\n",
"\t6911 & 2005-12-02 & NA & 19.50 & 9.156250 & 23.29167\\\\\n",
"\t6912 & 2005-12-03 & 13.34286 & 20.00 & 10.333333 & 25.19444\\\\\n",
"\t6913 & 2005-12-04 & 15.30000 & 15.50 & 13.177083 & 21.70833\\\\\n",
"\t6914 & 2005-12-05 & NA & 30.00 & 6.447917 & 28.38889\\\\\n",
"\t6915 & 2005-12-06 & 24.61667 & 33.00 & 4.701540 & 29.08333\\\\\n",
"\t6916 & 2005-12-07 & 37.80000 & 39.00 & 3.916214 & 34.30952\\\\\n",
"\t6917 & 2005-12-08 & 24.30000 & 31.00 & 5.995265 & 34.22222\\\\\n",
"\t6918 & 2005-12-09 & 25.45000 & 22.00 & 5.958333 & 31.41667\\\\\n",
"\t6919 & 2005-12-10 & 18.20000 & 30.00 & 9.135417 & 28.70833\\\\\n",
"\t6920 & 2005-12-11 & 10.60000 & 14.00 & 11.333333 & 22.55556\\\\\n",
"\t6921 & 2005-12-12 & 19.22500 & 28.75 & 5.031250 & 39.74621\\\\\n",
"\t6922 & 2005-12-13 & 26.50000 & 21.00 & 6.628623 & 29.56944\\\\\n",
"\t6923 & 2005-12-14 & 26.90000 & 16.00 & 3.802083 & 30.63384\\\\\n",
"\t6924 & 2005-12-15 & 14.40000 & 16.50 & 4.895833 & 25.43056\\\\\n",
"\t6925 & 2005-12-16 & 11.00000 & 22.00 & 11.166667 & 16.87500\\\\\n",
"\t6926 & 2005-12-17 & 13.80000 & 20.00 & 8.593750 & 20.73611\\\\\n",
"\t6927 & 2005-12-18 & 12.20000 & 17.50 & 13.552083 & 19.11111\\\\\n",
"\t6928 & 2005-12-19 & 21.15000 & 21.00 & 8.058877 & 31.79167\\\\\n",
"\t6929 & 2005-12-20 & 25.75000 & 32.00 & 3.849185 & 32.89773\\\\\n",
"\t6930 & 2005-12-21 & 37.92857 & 59.50 & 3.663949 & 34.86111\\\\\n",
"\t6931 & 2005-12-22 & 36.65000 & 42.50 & 5.385417 & 33.73026\\\\\n",
"\t6932 & 2005-12-23 & 32.90000 & 34.50 & 6.906250 & 29.08333\\\\\n",
"\t6933 & 2005-12-24 & 30.77143 & 25.20 & 1.770833 & 31.98611\\\\\n",
"\t6934 & 2005-12-25 & 6.70000 & 8.00 & 14.354167 & 13.79167\\\\\n",
"\t6935 & 2005-12-26 & 8.40000 & 8.50 & 14.041667 & 16.81944\\\\\n",
"\t6936 & 2005-12-27 & 23.56000 & 27.00 & 4.468750 & 23.50000\\\\\n",
"\t6937 & 2005-12-28 & 17.75000 & 27.50 & 3.260417 & 19.28563\\\\\n",
"\t6938 & 2005-12-29 & 7.45000 & 23.50 & 6.794837 & 19.97222\\\\\n",
"\t6939 & 2005-12-30 & 15.05714 & 19.20 & 3.034420 & 22.80556\\\\\n",
"\t6940 & 2005-12-31 & 15.00000 & 23.50 & 2.531250 & 13.25000\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 6940 × 5\n",
"\n",
"| | date <date> | pm25tmean2 <dbl> | pm10tmean2 <dbl> | o3tmean2 <dbl> | no2tmean2 <dbl> |\n",
"|---|---|---|---|---|---|\n",
"| 1 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |\n",
"| 2 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |\n",
"| 3 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |\n",
"| 4 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |\n",
"| 5 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 |\n",
"| 6 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |\n",
"| 7 | 1987-01-07 | NA | 41.00000 | 9.291667 | 20.58171 |\n",
"| 8 | 1987-01-08 | NA | 36.00000 | 11.291667 | 17.03723 |\n",
"| 9 | 1987-01-09 | NA | 33.28571 | 4.500000 | 23.38889 |\n",
"| 10 | 1987-01-10 | NA | NA | 4.958333 | 19.54167 |\n",
"| 11 | 1987-01-11 | NA | 22.00000 | 17.541667 | 13.70139 |\n",
"| 12 | 1987-01-12 | NA | 26.00000 | 8.000000 | 33.02083 |\n",
"| 13 | 1987-01-13 | NA | 53.00000 | 4.958333 | 38.06142 |\n",
"| 14 | 1987-01-14 | NA | 43.00000 | 4.208333 | 32.19444 |\n",
"| 15 | 1987-01-15 | NA | 28.83333 | 4.458333 | 18.87131 |\n",
"| 16 | 1987-01-16 | NA | 19.00000 | 7.916667 | 19.46667 |\n",
"| 17 | 1987-01-17 | NA | NA | 5.833333 | 20.70833 |\n",
"| 18 | 1987-01-18 | NA | 39.00000 | 6.375000 | 21.03333 |\n",
"| 19 | 1987-01-19 | NA | 32.00000 | 14.875000 | 17.17409 |\n",
"| 20 | 1987-01-20 | NA | 38.00000 | 7.250000 | 21.61021 |\n",
"| 21 | 1987-01-21 | NA | 32.85714 | 8.913043 | 24.52083 |\n",
"| 22 | 1987-01-22 | NA | 52.00000 | 10.500000 | 16.98798 |\n",
"| 23 | 1987-01-23 | NA | 55.00000 | 14.625000 | 14.66250 |\n",
"| 24 | 1987-01-24 | NA | 38.00000 | 10.083333 | 18.69167 |\n",
"| 25 | 1987-01-25 | NA | NA | 6.666667 | 26.30417 |\n",
"| 26 | 1987-01-26 | NA | 71.00000 | 4.583333 | 32.42143 |\n",
"| 27 | 1987-01-27 | NA | 39.33333 | 6.000000 | 30.69306 |\n",
"| 28 | 1987-01-28 | NA | 47.00000 | 6.875000 | 29.12943 |\n",
"| 29 | 1987-01-29 | NA | 35.00000 | 2.916667 | 28.14529 |\n",
"| 30 | 1987-01-30 | NA | 59.00000 | 8.791667 | 19.79861 |\n",
"| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |\n",
"| 6911 | 2005-12-02 | NA | 19.50 | 9.156250 | 23.29167 |\n",
"| 6912 | 2005-12-03 | 13.34286 | 20.00 | 10.333333 | 25.19444 |\n",
"| 6913 | 2005-12-04 | 15.30000 | 15.50 | 13.177083 | 21.70833 |\n",
"| 6914 | 2005-12-05 | NA | 30.00 | 6.447917 | 28.38889 |\n",
"| 6915 | 2005-12-06 | 24.61667 | 33.00 | 4.701540 | 29.08333 |\n",
"| 6916 | 2005-12-07 | 37.80000 | 39.00 | 3.916214 | 34.30952 |\n",
"| 6917 | 2005-12-08 | 24.30000 | 31.00 | 5.995265 | 34.22222 |\n",
"| 6918 | 2005-12-09 | 25.45000 | 22.00 | 5.958333 | 31.41667 |\n",
"| 6919 | 2005-12-10 | 18.20000 | 30.00 | 9.135417 | 28.70833 |\n",
"| 6920 | 2005-12-11 | 10.60000 | 14.00 | 11.333333 | 22.55556 |\n",
"| 6921 | 2005-12-12 | 19.22500 | 28.75 | 5.031250 | 39.74621 |\n",
"| 6922 | 2005-12-13 | 26.50000 | 21.00 | 6.628623 | 29.56944 |\n",
"| 6923 | 2005-12-14 | 26.90000 | 16.00 | 3.802083 | 30.63384 |\n",
"| 6924 | 2005-12-15 | 14.40000 | 16.50 | 4.895833 | 25.43056 |\n",
"| 6925 | 2005-12-16 | 11.00000 | 22.00 | 11.166667 | 16.87500 |\n",
"| 6926 | 2005-12-17 | 13.80000 | 20.00 | 8.593750 | 20.73611 |\n",
"| 6927 | 2005-12-18 | 12.20000 | 17.50 | 13.552083 | 19.11111 |\n",
"| 6928 | 2005-12-19 | 21.15000 | 21.00 | 8.058877 | 31.79167 |\n",
"| 6929 | 2005-12-20 | 25.75000 | 32.00 | 3.849185 | 32.89773 |\n",
"| 6930 | 2005-12-21 | 37.92857 | 59.50 | 3.663949 | 34.86111 |\n",
"| 6931 | 2005-12-22 | 36.65000 | 42.50 | 5.385417 | 33.73026 |\n",
"| 6932 | 2005-12-23 | 32.90000 | 34.50 | 6.906250 | 29.08333 |\n",
"| 6933 | 2005-12-24 | 30.77143 | 25.20 | 1.770833 | 31.98611 |\n",
"| 6934 | 2005-12-25 | 6.70000 | 8.00 | 14.354167 | 13.79167 |\n",
"| 6935 | 2005-12-26 | 8.40000 | 8.50 | 14.041667 | 16.81944 |\n",
"| 6936 | 2005-12-27 | 23.56000 | 27.00 | 4.468750 | 23.50000 |\n",
"| 6937 | 2005-12-28 | 17.75000 | 27.50 | 3.260417 | 19.28563 |\n",
"| 6938 | 2005-12-29 | 7.45000 | 23.50 | 6.794837 | 19.97222 |\n",
"| 6939 | 2005-12-30 | 15.05714 | 19.20 | 3.034420 | 22.80556 |\n",
"| 6940 | 2005-12-31 | 15.00000 | 23.50 | 2.531250 | 13.25000 |\n",
"\n"
],
"text/plain": [
" date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n",
"1 1987-01-01 NA 34.00000 4.250000 19.98810 \n",
"2 1987-01-02 NA NA 3.304348 23.19099 \n",
"3 1987-01-03 NA 34.16667 3.333333 23.81548 \n",
"4 1987-01-04 NA 47.00000 4.375000 30.43452 \n",
"5 1987-01-05 NA NA 4.750000 30.33333 \n",
"6 1987-01-06 NA 48.00000 5.833333 25.77233 \n",
"7 1987-01-07 NA 41.00000 9.291667 20.58171 \n",
"8 1987-01-08 NA 36.00000 11.291667 17.03723 \n",
"9 1987-01-09 NA 33.28571 4.500000 23.38889 \n",
"10 1987-01-10 NA NA 4.958333 19.54167 \n",
"11 1987-01-11 NA 22.00000 17.541667 13.70139 \n",
"12 1987-01-12 NA 26.00000 8.000000 33.02083 \n",
"13 1987-01-13 NA 53.00000 4.958333 38.06142 \n",
"14 1987-01-14 NA 43.00000 4.208333 32.19444 \n",
"15 1987-01-15 NA 28.83333 4.458333 18.87131 \n",
"16 1987-01-16 NA 19.00000 7.916667 19.46667 \n",
"17 1987-01-17 NA NA 5.833333 20.70833 \n",
"18 1987-01-18 NA 39.00000 6.375000 21.03333 \n",
"19 1987-01-19 NA 32.00000 14.875000 17.17409 \n",
"20 1987-01-20 NA 38.00000 7.250000 21.61021 \n",
"21 1987-01-21 NA 32.85714 8.913043 24.52083 \n",
"22 1987-01-22 NA 52.00000 10.500000 16.98798 \n",
"23 1987-01-23 NA 55.00000 14.625000 14.66250 \n",
"24 1987-01-24 NA 38.00000 10.083333 18.69167 \n",
"25 1987-01-25 NA NA 6.666667 26.30417 \n",
"26 1987-01-26 NA 71.00000 4.583333 32.42143 \n",
"27 1987-01-27 NA 39.33333 6.000000 30.69306 \n",
"28 1987-01-28 NA 47.00000 6.875000 29.12943 \n",
"29 1987-01-29 NA 35.00000 2.916667 28.14529 \n",
"30 1987-01-30 NA 59.00000 8.791667 19.79861 \n",
"⋮ ⋮ ⋮ ⋮ ⋮ ⋮ \n",
"6911 2005-12-02 NA 19.50 9.156250 23.29167 \n",
"6912 2005-12-03 13.34286 20.00 10.333333 25.19444 \n",
"6913 2005-12-04 15.30000 15.50 13.177083 21.70833 \n",
"6914 2005-12-05 NA 30.00 6.447917 28.38889 \n",
"6915 2005-12-06 24.61667 33.00 4.701540 29.08333 \n",
"6916 2005-12-07 37.80000 39.00 3.916214 34.30952 \n",
"6917 2005-12-08 24.30000 31.00 5.995265 34.22222 \n",
"6918 2005-12-09 25.45000 22.00 5.958333 31.41667 \n",
"6919 2005-12-10 18.20000 30.00 9.135417 28.70833 \n",
"6920 2005-12-11 10.60000 14.00 11.333333 22.55556 \n",
"6921 2005-12-12 19.22500 28.75 5.031250 39.74621 \n",
"6922 2005-12-13 26.50000 21.00 6.628623 29.56944 \n",
"6923 2005-12-14 26.90000 16.00 3.802083 30.63384 \n",
"6924 2005-12-15 14.40000 16.50 4.895833 25.43056 \n",
"6925 2005-12-16 11.00000 22.00 11.166667 16.87500 \n",
"6926 2005-12-17 13.80000 20.00 8.593750 20.73611 \n",
"6927 2005-12-18 12.20000 17.50 13.552083 19.11111 \n",
"6928 2005-12-19 21.15000 21.00 8.058877 31.79167 \n",
"6929 2005-12-20 25.75000 32.00 3.849185 32.89773 \n",
"6930 2005-12-21 37.92857 59.50 3.663949 34.86111 \n",
"6931 2005-12-22 36.65000 42.50 5.385417 33.73026 \n",
"6932 2005-12-23 32.90000 34.50 6.906250 29.08333 \n",
"6933 2005-12-24 30.77143 25.20 1.770833 31.98611 \n",
"6934 2005-12-25 6.70000 8.00 14.354167 13.79167 \n",
"6935 2005-12-26 8.40000 8.50 14.041667 16.81944 \n",
"6936 2005-12-27 23.56000 27.00 4.468750 23.50000 \n",
"6937 2005-12-28 17.75000 27.50 3.260417 19.28563 \n",
"6938 2005-12-29 7.45000 23.50 6.794837 19.97222 \n",
"6939 2005-12-30 15.05714 19.20 3.034420 22.80556 \n",
"6940 2005-12-31 15.00000 23.50 2.531250 13.25000 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"select(chicago, -(city:dptp))"
]
},
{
"cell_type": "markdown",
"id": "10e5c2c9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If you wanted to keep every variable that ends with a \"2\", we could do"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "e6c24fc3",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'data.frame':\t6940 obs. of 4 variables:\n",
" $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n",
" $ pm10tmean2: num 34 NA 34.2 47 NA ...\n",
" $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n",
" $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n"
]
}
],
"source": [
"subset <- select(chicago, ends_with(\"2\"))\n",
"str(subset)"
]
},
{
"cell_type": "markdown",
"id": "6c5f3342",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Or if we wanted to keep every variable that starts with a \"d\", we could do"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "1921b790",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'data.frame':\t6940 obs. of 2 variables:\n",
" $ dptp: num 31.5 29.9 27.4 28.6 28.9 ...\n",
" $ date: Date, format: \"1987-01-01\" \"1987-01-02\" ...\n"
]
}
],
"source": [
"subset <- select(chicago, starts_with(\"d\"))\n",
"str(subset)"
]
},
{
"cell_type": "markdown",
"id": "435eaf63",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `filter()`\n",
"\n",
"- The `filter()` function is used to extract subsets of rows from a data frame.\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "84a28bc2",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'data.frame':\t194 obs. of 8 variables:\n",
" $ city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n",
" $ tmpd : num 23 28 55 59 57 57 75 61 73 78 ...\n",
" $ dptp : num 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n",
" $ date : Date, format: \"1998-01-17\" \"1998-01-23\" ...\n",
" $ pm25tmean2: num 38.1 34 39.4 35.4 33.3 ...\n",
" $ pm10tmean2: num 32.5 38.7 34 28.5 35 ...\n",
" $ o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ...\n",
" $ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ...\n"
]
},
{
"data": {
"text/plain": [
" Min. 1st Qu. Median Mean 3rd Qu. Max. \n",
" 30.05 32.12 35.04 36.63 39.53 61.50 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chic.f <- filter(chicago, pm25tmean2 > 30)\n",
"str(chic.f)\n",
"summary(chic.f$pm25tmean2)"
]
},
{
"cell_type": "markdown",
"id": "f767ec08",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "b4b3ff2a",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A data.frame: 17 × 3\n",
"\n",
"\tdate | tmpd | pm25tmean2 |
\n",
"\t<date> | <dbl> | <dbl> |
\n",
"\n",
"\n",
"\t1998-08-23 | 81 | 39.60000 |
\n",
"\t1998-09-06 | 81 | 31.50000 |
\n",
"\t2001-07-20 | 82 | 32.30000 |
\n",
"\t2001-08-01 | 84 | 43.70000 |
\n",
"\t2001-08-08 | 85 | 38.83750 |
\n",
"\t2001-08-09 | 84 | 38.20000 |
\n",
"\t2002-06-20 | 82 | 33.00000 |
\n",
"\t2002-06-23 | 82 | 42.50000 |
\n",
"\t2002-07-08 | 81 | 33.10000 |
\n",
"\t2002-07-18 | 82 | 38.85000 |
\n",
"\t2003-06-25 | 82 | 33.90000 |
\n",
"\t2003-07-04 | 84 | 32.90000 |
\n",
"\t2005-06-24 | 86 | 31.85714 |
\n",
"\t2005-06-27 | 82 | 51.53750 |
\n",
"\t2005-06-28 | 85 | 31.20000 |
\n",
"\t2005-07-17 | 84 | 32.70000 |
\n",
"\t2005-08-03 | 84 | 37.90000 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 17 × 3\n",
"\\begin{tabular}{lll}\n",
" date & tmpd & pm25tmean2\\\\\n",
" & & \\\\\n",
"\\hline\n",
"\t 1998-08-23 & 81 & 39.60000\\\\\n",
"\t 1998-09-06 & 81 & 31.50000\\\\\n",
"\t 2001-07-20 & 82 & 32.30000\\\\\n",
"\t 2001-08-01 & 84 & 43.70000\\\\\n",
"\t 2001-08-08 & 85 & 38.83750\\\\\n",
"\t 2001-08-09 & 84 & 38.20000\\\\\n",
"\t 2002-06-20 & 82 & 33.00000\\\\\n",
"\t 2002-06-23 & 82 & 42.50000\\\\\n",
"\t 2002-07-08 & 81 & 33.10000\\\\\n",
"\t 2002-07-18 & 82 & 38.85000\\\\\n",
"\t 2003-06-25 & 82 & 33.90000\\\\\n",
"\t 2003-07-04 & 84 & 32.90000\\\\\n",
"\t 2005-06-24 & 86 & 31.85714\\\\\n",
"\t 2005-06-27 & 82 & 51.53750\\\\\n",
"\t 2005-06-28 & 85 & 31.20000\\\\\n",
"\t 2005-07-17 & 84 & 32.70000\\\\\n",
"\t 2005-08-03 & 84 & 37.90000\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 17 × 3\n",
"\n",
"| date <date> | tmpd <dbl> | pm25tmean2 <dbl> |\n",
"|---|---|---|\n",
"| 1998-08-23 | 81 | 39.60000 |\n",
"| 1998-09-06 | 81 | 31.50000 |\n",
"| 2001-07-20 | 82 | 32.30000 |\n",
"| 2001-08-01 | 84 | 43.70000 |\n",
"| 2001-08-08 | 85 | 38.83750 |\n",
"| 2001-08-09 | 84 | 38.20000 |\n",
"| 2002-06-20 | 82 | 33.00000 |\n",
"| 2002-06-23 | 82 | 42.50000 |\n",
"| 2002-07-08 | 81 | 33.10000 |\n",
"| 2002-07-18 | 82 | 38.85000 |\n",
"| 2003-06-25 | 82 | 33.90000 |\n",
"| 2003-07-04 | 84 | 32.90000 |\n",
"| 2005-06-24 | 86 | 31.85714 |\n",
"| 2005-06-27 | 82 | 51.53750 |\n",
"| 2005-06-28 | 85 | 31.20000 |\n",
"| 2005-07-17 | 84 | 32.70000 |\n",
"| 2005-08-03 | 84 | 37.90000 |\n",
"\n"
],
"text/plain": [
" date tmpd pm25tmean2\n",
"1 1998-08-23 81 39.60000 \n",
"2 1998-09-06 81 31.50000 \n",
"3 2001-07-20 82 32.30000 \n",
"4 2001-08-01 84 43.70000 \n",
"5 2001-08-08 85 38.83750 \n",
"6 2001-08-09 84 38.20000 \n",
"7 2002-06-20 82 33.00000 \n",
"8 2002-06-23 82 42.50000 \n",
"9 2002-07-08 81 33.10000 \n",
"10 2002-07-18 82 38.85000 \n",
"11 2003-06-25 82 33.90000 \n",
"12 2003-07-04 84 32.90000 \n",
"13 2005-06-24 86 31.85714 \n",
"14 2005-06-27 82 51.53750 \n",
"15 2005-06-28 85 31.20000 \n",
"16 2005-07-17 84 32.70000 \n",
"17 2005-08-03 84 37.90000 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\n",
"select(chic.f, date, tmpd, pm25tmean2)"
]
},
{
"cell_type": "markdown",
"id": "b37953e2",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `arrange()`\n",
"\n",
"The `arrange()` function is used to reorder rows of a data frame according to one of the variables/columns.\n",
"\n",
"- Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "1a734fd9",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"\t | date | pm25tmean2 |
\n",
"\t | <date> | <dbl> |
\n",
"\n",
"\n",
"\t1 | 1987-01-01 | NA |
\n",
"\t2 | 1987-01-02 | NA |
\n",
"\t3 | 1987-01-03 | NA |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 3 × 2\n",
"\\begin{tabular}{r|ll}\n",
" & date & pm25tmean2\\\\\n",
" & & \\\\\n",
"\\hline\n",
"\t1 & 1987-01-01 & NA\\\\\n",
"\t2 & 1987-01-02 & NA\\\\\n",
"\t3 & 1987-01-03 & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"| | date <date> | pm25tmean2 <dbl> |\n",
"|---|---|---|\n",
"| 1 | 1987-01-01 | NA |\n",
"| 2 | 1987-01-02 | NA |\n",
"| 3 | 1987-01-03 | NA |\n",
"\n"
],
"text/plain": [
" date pm25tmean2\n",
"1 1987-01-01 NA \n",
"2 1987-01-02 NA \n",
"3 1987-01-03 NA "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"\t | date | pm25tmean2 |
\n",
"\t | <date> | <dbl> |
\n",
"\n",
"\n",
"\t6938 | 2005-12-29 | 7.45000 |
\n",
"\t6939 | 2005-12-30 | 15.05714 |
\n",
"\t6940 | 2005-12-31 | 15.00000 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 3 × 2\n",
"\\begin{tabular}{r|ll}\n",
" & date & pm25tmean2\\\\\n",
" & & \\\\\n",
"\\hline\n",
"\t6938 & 2005-12-29 & 7.45000\\\\\n",
"\t6939 & 2005-12-30 & 15.05714\\\\\n",
"\t6940 & 2005-12-31 & 15.00000\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"| | date <date> | pm25tmean2 <dbl> |\n",
"|---|---|---|\n",
"| 6938 | 2005-12-29 | 7.45000 |\n",
"| 6939 | 2005-12-30 | 15.05714 |\n",
"| 6940 | 2005-12-31 | 15.00000 |\n",
"\n"
],
"text/plain": [
" date pm25tmean2\n",
"6938 2005-12-29 7.45000 \n",
"6939 2005-12-30 15.05714 \n",
"6940 2005-12-31 15.00000 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chicago <- arrange(chicago, date)\n",
"head(select(chicago, date, pm25tmean2), 3)\n",
"tail(select(chicago, date, pm25tmean2), 3)"
]
},
{
"cell_type": "markdown",
"id": "dffd7ce9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Columns can be arranged in descending order too by useing the special `desc()` operator. Looking at the first three and last three rows shows the dates in descending order."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "131db5b6",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"\t | date | pm25tmean2 |
\n",
"\t | <date> | <dbl> |
\n",
"\n",
"\n",
"\t1 | 2005-12-31 | 15.00000 |
\n",
"\t2 | 2005-12-30 | 15.05714 |
\n",
"\t3 | 2005-12-29 | 7.45000 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 3 × 2\n",
"\\begin{tabular}{r|ll}\n",
" & date & pm25tmean2\\\\\n",
" & & \\\\\n",
"\\hline\n",
"\t1 & 2005-12-31 & 15.00000\\\\\n",
"\t2 & 2005-12-30 & 15.05714\\\\\n",
"\t3 & 2005-12-29 & 7.45000\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"| | date <date> | pm25tmean2 <dbl> |\n",
"|---|---|---|\n",
"| 1 | 2005-12-31 | 15.00000 |\n",
"| 2 | 2005-12-30 | 15.05714 |\n",
"| 3 | 2005-12-29 | 7.45000 |\n",
"\n"
],
"text/plain": [
" date pm25tmean2\n",
"1 2005-12-31 15.00000 \n",
"2 2005-12-30 15.05714 \n",
"3 2005-12-29 7.45000 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"\t | date | pm25tmean2 |
\n",
"\t | <date> | <dbl> |
\n",
"\n",
"\n",
"\t6938 | 1987-01-03 | NA |
\n",
"\t6939 | 1987-01-02 | NA |
\n",
"\t6940 | 1987-01-01 | NA |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 3 × 2\n",
"\\begin{tabular}{r|ll}\n",
" & date & pm25tmean2\\\\\n",
" & & \\\\\n",
"\\hline\n",
"\t6938 & 1987-01-03 & NA\\\\\n",
"\t6939 & 1987-01-02 & NA\\\\\n",
"\t6940 & 1987-01-01 & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 3 × 2\n",
"\n",
"| | date <date> | pm25tmean2 <dbl> |\n",
"|---|---|---|\n",
"| 6938 | 1987-01-03 | NA |\n",
"| 6939 | 1987-01-02 | NA |\n",
"| 6940 | 1987-01-01 | NA |\n",
"\n"
],
"text/plain": [
" date pm25tmean2\n",
"6938 1987-01-03 NA \n",
"6939 1987-01-02 NA \n",
"6940 1987-01-01 NA "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chicago2 <- arrange(chicago, desc(date))\n",
"head(select(chicago2, date, pm25tmean2), 3)\n",
"tail(select(chicago2, date, pm25tmean2), 3)"
]
},
{
"cell_type": "markdown",
"id": "7fc22718",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `rename()`\n",
"\n",
"Renaming a variable in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.\n",
"\n",
"- Here you can see the names of the first five variables in the `chicago` data frame. Now we rename the awkward variable names.\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "387eaecb",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A data.frame: 3 × 5\n",
"\n",
"\t | city | tmpd | dptp | date | pm25tmean2 |
\n",
"\t | <chr> | <dbl> | <dbl> | <date> | <dbl> |
\n",
"\n",
"\n",
"\t1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |
\n",
"\t2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |
\n",
"\t3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 3 × 5\n",
"\\begin{tabular}{r|lllll}\n",
" & city & tmpd & dptp & date & pm25tmean2\\\\\n",
" & & & & & \\\\\n",
"\\hline\n",
"\t1 & chic & 31.5 & 31.500 & 1987-01-01 & NA\\\\\n",
"\t2 & chic & 33.0 & 29.875 & 1987-01-02 & NA\\\\\n",
"\t3 & chic & 33.0 & 27.375 & 1987-01-03 & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 3 × 5\n",
"\n",
"| | city <chr> | tmpd <dbl> | dptp <dbl> | date <date> | pm25tmean2 <dbl> |\n",
"|---|---|---|---|---|---|\n",
"| 1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |\n",
"| 2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |\n",
"| 3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |\n",
"\n"
],
"text/plain": [
" city tmpd dptp date pm25tmean2\n",
"1 chic 31.5 31.500 1987-01-01 NA \n",
"2 chic 33.0 29.875 1987-01-02 NA \n",
"3 chic 33.0 27.375 1987-01-03 NA "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"A data.frame: 3 × 5\n",
"\n",
"\t | city | tmpd | dewpoint | date | pm25 |
\n",
"\t | <chr> | <dbl> | <dbl> | <date> | <dbl> |
\n",
"\n",
"\n",
"\t1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |
\n",
"\t2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |
\n",
"\t3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 3 × 5\n",
"\\begin{tabular}{r|lllll}\n",
" & city & tmpd & dewpoint & date & pm25\\\\\n",
" & & & & & \\\\\n",
"\\hline\n",
"\t1 & chic & 31.5 & 31.500 & 1987-01-01 & NA\\\\\n",
"\t2 & chic & 33.0 & 29.875 & 1987-01-02 & NA\\\\\n",
"\t3 & chic & 33.0 & 27.375 & 1987-01-03 & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 3 × 5\n",
"\n",
"| | city <chr> | tmpd <dbl> | dewpoint <dbl> | date <date> | pm25 <dbl> |\n",
"|---|---|---|---|---|---|\n",
"| 1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |\n",
"| 2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |\n",
"| 3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |\n",
"\n"
],
"text/plain": [
" city tmpd dewpoint date pm25\n",
"1 chic 31.5 31.500 1987-01-01 NA \n",
"2 chic 33.0 29.875 1987-01-02 NA \n",
"3 chic 33.0 27.375 1987-01-03 NA "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"head(chicago[, 1:5], 3)\n",
"chicago3 <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\n",
"head(chicago3[, 1:5], 3) # with new variable name"
]
},
{
"cell_type": "markdown",
"id": "fb427d38",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `mutate()`\n",
"\n",
"The `mutate()` function exists to compute transformations of variables in a data frame.\n",
"\n",
"- For example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data."
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "5713cbfe",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A data.frame: 6 × 9\n",
"\n",
"\t | city | tmpd | dewpoint | date | pm25 | pm10tmean2 | o3tmean2 | no2tmean2 | pm25detrend |
\n",
"\t | <chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
\n",
"\n",
"\n",
"\t1 | chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 | NA |
\n",
"\t2 | chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 | NA |
\n",
"\t3 | chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 | NA |
\n",
"\t4 | chic | 29.0 | 28.625 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 | NA |
\n",
"\t5 | chic | 32.0 | 28.875 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 | NA |
\n",
"\t6 | chic | 40.0 | 35.125 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 | NA |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 6 × 9\n",
"\\begin{tabular}{r|lllllllll}\n",
" & city & tmpd & dewpoint & date & pm25 & pm10tmean2 & o3tmean2 & no2tmean2 & pm25detrend\\\\\n",
" & & & & & & & & & \\\\\n",
"\\hline\n",
"\t1 & chic & 31.5 & 31.500 & 1987-01-01 & NA & 34.00000 & 4.250000 & 19.98810 & NA\\\\\n",
"\t2 & chic & 33.0 & 29.875 & 1987-01-02 & NA & NA & 3.304348 & 23.19099 & NA\\\\\n",
"\t3 & chic & 33.0 & 27.375 & 1987-01-03 & NA & 34.16667 & 3.333333 & 23.81548 & NA\\\\\n",
"\t4 & chic & 29.0 & 28.625 & 1987-01-04 & NA & 47.00000 & 4.375000 & 30.43452 & NA\\\\\n",
"\t5 & chic & 32.0 & 28.875 & 1987-01-05 & NA & NA & 4.750000 & 30.33333 & NA\\\\\n",
"\t6 & chic & 40.0 & 35.125 & 1987-01-06 & NA & 48.00000 & 5.833333 & 25.77233 & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 6 × 9\n",
"\n",
"| | city <chr> | tmpd <dbl> | dewpoint <dbl> | date <date> | pm25 <dbl> | pm10tmean2 <dbl> | o3tmean2 <dbl> | no2tmean2 <dbl> | pm25detrend <dbl> |\n",
"|---|---|---|---|---|---|---|---|---|---|\n",
"| 1 | chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 | NA |\n",
"| 2 | chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 | NA |\n",
"| 3 | chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 | NA |\n",
"| 4 | chic | 29.0 | 28.625 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 | NA |\n",
"| 5 | chic | 32.0 | 28.875 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 | NA |\n",
"| 6 | chic | 40.0 | 35.125 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 | NA |\n",
"\n"
],
"text/plain": [
" city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2 pm25detrend\n",
"1 chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810 NA \n",
"2 chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099 NA \n",
"3 chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548 NA \n",
"4 chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452 NA \n",
"5 chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333 NA \n",
"6 chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233 NA "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chicago4 <- mutate(chicago3, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\n",
"head(chicago4)"
]
},
{
"cell_type": "markdown",
"id": "1af51119",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `group_by()`\n",
"\n",
"The `group_by()` function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is.\n",
"\n",
"- First, we can create a `year` variable using `as.POSIXlt()`.\n",
"\n",
"- Now we can create a separate data frame that splits the original data frame by year."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "0090cff3",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A tibble: 19 × 4\n",
"\n",
"\tyear | pm25 | o3 | no2 |
\n",
"\t<dbl> | <dbl> | <dbl> | <dbl> |
\n",
"\n",
"\n",
"\t1987 | NaN | 62.96966 | 23.49369 |
\n",
"\t1988 | NaN | 61.67708 | 24.52296 |
\n",
"\t1989 | NaN | 59.72727 | 26.14062 |
\n",
"\t1990 | NaN | 52.22917 | 22.59583 |
\n",
"\t1991 | NaN | 63.10417 | 21.38194 |
\n",
"\t1992 | NaN | 50.82870 | 24.78921 |
\n",
"\t1993 | NaN | 44.30093 | 25.76993 |
\n",
"\t1994 | NaN | 52.17844 | 28.47500 |
\n",
"\t1995 | NaN | 66.58750 | 27.26042 |
\n",
"\t1996 | NaN | 58.39583 | 26.38715 |
\n",
"\t1997 | NaN | 56.54167 | 25.48143 |
\n",
"\t1998 | 18.26467 | 50.66250 | 24.58649 |
\n",
"\t1999 | 18.49646 | 57.48864 | 24.66667 |
\n",
"\t2000 | 16.93806 | 55.76103 | 23.46082 |
\n",
"\t2001 | 16.92632 | 51.81984 | 25.06522 |
\n",
"\t2002 | 15.27335 | 54.88043 | 22.73750 |
\n",
"\t2003 | 15.23183 | 56.16608 | 24.62500 |
\n",
"\t2004 | 14.62864 | 44.48240 | 23.39130 |
\n",
"\t2005 | 16.18556 | 58.84126 | 22.62387 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A tibble: 19 × 4\n",
"\\begin{tabular}{llll}\n",
" year & pm25 & o3 & no2\\\\\n",
" & & & \\\\\n",
"\\hline\n",
"\t 1987 & NaN & 62.96966 & 23.49369\\\\\n",
"\t 1988 & NaN & 61.67708 & 24.52296\\\\\n",
"\t 1989 & NaN & 59.72727 & 26.14062\\\\\n",
"\t 1990 & NaN & 52.22917 & 22.59583\\\\\n",
"\t 1991 & NaN & 63.10417 & 21.38194\\\\\n",
"\t 1992 & NaN & 50.82870 & 24.78921\\\\\n",
"\t 1993 & NaN & 44.30093 & 25.76993\\\\\n",
"\t 1994 & NaN & 52.17844 & 28.47500\\\\\n",
"\t 1995 & NaN & 66.58750 & 27.26042\\\\\n",
"\t 1996 & NaN & 58.39583 & 26.38715\\\\\n",
"\t 1997 & NaN & 56.54167 & 25.48143\\\\\n",
"\t 1998 & 18.26467 & 50.66250 & 24.58649\\\\\n",
"\t 1999 & 18.49646 & 57.48864 & 24.66667\\\\\n",
"\t 2000 & 16.93806 & 55.76103 & 23.46082\\\\\n",
"\t 2001 & 16.92632 & 51.81984 & 25.06522\\\\\n",
"\t 2002 & 15.27335 & 54.88043 & 22.73750\\\\\n",
"\t 2003 & 15.23183 & 56.16608 & 24.62500\\\\\n",
"\t 2004 & 14.62864 & 44.48240 & 23.39130\\\\\n",
"\t 2005 & 16.18556 & 58.84126 & 22.62387\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 19 × 4\n",
"\n",
"| year <dbl> | pm25 <dbl> | o3 <dbl> | no2 <dbl> |\n",
"|---|---|---|---|\n",
"| 1987 | NaN | 62.96966 | 23.49369 |\n",
"| 1988 | NaN | 61.67708 | 24.52296 |\n",
"| 1989 | NaN | 59.72727 | 26.14062 |\n",
"| 1990 | NaN | 52.22917 | 22.59583 |\n",
"| 1991 | NaN | 63.10417 | 21.38194 |\n",
"| 1992 | NaN | 50.82870 | 24.78921 |\n",
"| 1993 | NaN | 44.30093 | 25.76993 |\n",
"| 1994 | NaN | 52.17844 | 28.47500 |\n",
"| 1995 | NaN | 66.58750 | 27.26042 |\n",
"| 1996 | NaN | 58.39583 | 26.38715 |\n",
"| 1997 | NaN | 56.54167 | 25.48143 |\n",
"| 1998 | 18.26467 | 50.66250 | 24.58649 |\n",
"| 1999 | 18.49646 | 57.48864 | 24.66667 |\n",
"| 2000 | 16.93806 | 55.76103 | 23.46082 |\n",
"| 2001 | 16.92632 | 51.81984 | 25.06522 |\n",
"| 2002 | 15.27335 | 54.88043 | 22.73750 |\n",
"| 2003 | 15.23183 | 56.16608 | 24.62500 |\n",
"| 2004 | 14.62864 | 44.48240 | 23.39130 |\n",
"| 2005 | 16.18556 | 58.84126 | 22.62387 |\n",
"\n"
],
"text/plain": [
" year pm25 o3 no2 \n",
"1 1987 NaN 62.96966 23.49369\n",
"2 1988 NaN 61.67708 24.52296\n",
"3 1989 NaN 59.72727 26.14062\n",
"4 1990 NaN 52.22917 22.59583\n",
"5 1991 NaN 63.10417 21.38194\n",
"6 1992 NaN 50.82870 24.78921\n",
"7 1993 NaN 44.30093 25.76993\n",
"8 1994 NaN 52.17844 28.47500\n",
"9 1995 NaN 66.58750 27.26042\n",
"10 1996 NaN 58.39583 26.38715\n",
"11 1997 NaN 56.54167 25.48143\n",
"12 1998 18.26467 50.66250 24.58649\n",
"13 1999 18.49646 57.48864 24.66667\n",
"14 2000 16.93806 55.76103 23.46082\n",
"15 2001 16.92632 51.81984 25.06522\n",
"16 2002 15.27335 54.88043 22.73750\n",
"17 2003 15.23183 56.16608 24.62500\n",
"18 2004 14.62864 44.48240 23.39130\n",
"19 2005 16.18556 58.84126 22.62387"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chicago5 <- mutate(chicago3, year = as.POSIXlt(date)$year + 1900)\n",
"years <- group_by(chicago5, year)\n",
"summarize(years, pm25 = mean(pm25, na.rm = TRUE),\n",
" o3 = max(o3tmean2, na.rm = TRUE),\n",
" no2 = median(no2tmean2, na.rm = TRUE))"
]
},
{
"cell_type": "markdown",
"id": "a63cc665",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.\n",
"\n",
"- First, we can create a categorical variable of `pm25` divided into quintiles.\n",
"\n",
"- Now we can group the data frame by the `pm25.quint` variable.\n",
"\n",
"- Finally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`."
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "5423b372",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A tibble: 6 × 3\n",
"\n",
"\tpm25.quint | o3 | no2 |
\n",
"\t<fct> | <dbl> | <dbl> |
\n",
"\n",
"\n",
"\t(1.7,8.7] | 21.66401 | 17.99129 |
\n",
"\t(8.7,12.4] | 20.38248 | 22.13004 |
\n",
"\t(12.4,16.7] | 20.66160 | 24.35708 |
\n",
"\t(16.7,22.6] | 19.88122 | 27.27132 |
\n",
"\t(22.6,61.5] | 20.31775 | 29.64427 |
\n",
"\tNA | 18.79044 | 25.77585 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A tibble: 6 × 3\n",
"\\begin{tabular}{lll}\n",
" pm25.quint & o3 & no2\\\\\n",
" & & \\\\\n",
"\\hline\n",
"\t (1.7,8.7{]} & 21.66401 & 17.99129\\\\\n",
"\t (8.7,12.4{]} & 20.38248 & 22.13004\\\\\n",
"\t (12.4,16.7{]} & 20.66160 & 24.35708\\\\\n",
"\t (16.7,22.6{]} & 19.88122 & 27.27132\\\\\n",
"\t (22.6,61.5{]} & 20.31775 & 29.64427\\\\\n",
"\t NA & 18.79044 & 25.77585\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 6 × 3\n",
"\n",
"| pm25.quint <fct> | o3 <dbl> | no2 <dbl> |\n",
"|---|---|---|\n",
"| (1.7,8.7] | 21.66401 | 17.99129 |\n",
"| (8.7,12.4] | 20.38248 | 22.13004 |\n",
"| (12.4,16.7] | 20.66160 | 24.35708 |\n",
"| (16.7,22.6] | 19.88122 | 27.27132 |\n",
"| (22.6,61.5] | 20.31775 | 29.64427 |\n",
"| NA | 18.79044 | 25.77585 |\n",
"\n"
],
"text/plain": [
" pm25.quint o3 no2 \n",
"1 (1.7,8.7] 21.66401 17.99129\n",
"2 (8.7,12.4] 20.38248 22.13004\n",
"3 (12.4,16.7] 20.66160 24.35708\n",
"4 (16.7,22.6] 19.88122 27.27132\n",
"5 (22.6,61.5] 20.31775 29.64427\n",
"6 NA 18.79044 25.77585"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"qq <- quantile(chicago3$pm25, seq(0, 1, 0.2), na.rm = TRUE)\n",
"chicago6 <- mutate(chicago3, pm25.quint = cut(pm25, qq))\n",
"quint <- group_by(chicago6, pm25.quint)\n",
"summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),\n",
" no2 = mean(no2tmean2, na.rm = TRUE))"
]
},
{
"cell_type": "markdown",
"id": "e9d43f2a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Summary\n",
"\n",
"The `dplyr` package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.\n",
"\n",
"* `dplyr` can work with other data frame \"backends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n",
"\n",
"* `dplyr` can be integrated with the `data.table` package for large fast tables\n",
"\n",
"* The `dplyr` package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!\n"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.3.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}