{ "cells": [ { "cell_type": "markdown", "id": "c57b2ce5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# R Builtin Data Structures\n", "\n", "\n", "Feng Li\n", "\n", "School of Statistics and Mathematics\n", "\n", "Central University of Finance and Economics\n", "\n", "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n", "\n", "[https://feng.li/statcomp](https://feng.li/statcomp)\n", "\n", "_>>> Link to Python version_ [1](https://feng.li/files/python/P02-Python-Data-Structures/L02.1-Python-Builtin-Data-Structures.slides.html), [2](https://feng.li/files/python/P02-Python-Data-Structures/L02.2-Data-Wrangling-with-Pandas.slides.html), [3](https://feng.li/files/python/P02-Python-Data-Structures/L02.3-Manipulating-DataFrames-with-Pandas.slides.html)" ] }, { "cell_type": "markdown", "id": "a0af6f40", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Generate a sequence\n", "\n", "### Sequences\n", "\n", "- Generate a sequence: seq()\n", "\n", "- Repeat a vector: rep()" ] }, { "cell_type": "code", "execution_count": 2, "id": "776c00b1", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17
18. 18
19. 19
20. 20
21. 21
22. 22
23. 23
24. 24
25. 25
26. 26
27. 27
28. 28
29. 29
30. 30
31. 31
32. 32
33. 33
34. 34
35. 35
36. 36
37. 37
38. 38
39. 39
40. 40
41. 41
42. 42
43. 43
44. 44
45. 45
46. 46
47. 47
48. 48
49. 49
50. 50
51. 51
52. 52
53. 53
54. 54
55. 55
56. 56
57. 57
58. 58
59. 59
60. 60
61. 61
62. 62
63. 63
64. 64
65. 65
66. 66
67. 67
68. 68
69. 69
70. 70
71. 71
72. 72
73. 73
74. 74
75. 75
76. 76
77. 77
78. 78
79. 79
80. 80
81. 81
82. 82
83. 83
84. 84
85. 85
86. 86
87. 87
88. 88
89. 89
90. 90
91. 91
92. 92
93. 93
94. 94
95. 95
96. 96
97. 97
98. 98
99. 99
100. 100
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 1\n", "\\item 2\n", "\\item 3\n", "\\item 4\n", "\\item 5\n", "\\item 6\n", "\\item 7\n", "\\item 8\n", "\\item 9\n", "\\item 10\n", "\\item 11\n", "\\item 12\n", "\\item 13\n", "\\item 14\n", "\\item 15\n", "\\item 16\n", "\\item 17\n", "\\item 18\n", "\\item 19\n", "\\item 20\n", "\\item 21\n", "\\item 22\n", "\\item 23\n", "\\item 24\n", "\\item 25\n", "\\item 26\n", "\\item 27\n", "\\item 28\n", "\\item 29\n", "\\item 30\n", "\\item 31\n", "\\item 32\n", "\\item 33\n", "\\item 34\n", "\\item 35\n", "\\item 36\n", "\\item 37\n", "\\item 38\n", "\\item 39\n", "\\item 40\n", "\\item 41\n", "\\item 42\n", "\\item 43\n", "\\item 44\n", "\\item 45\n", "\\item 46\n", "\\item 47\n", "\\item 48\n", "\\item 49\n", "\\item 50\n", "\\item 51\n", "\\item 52\n", "\\item 53\n", "\\item 54\n", "\\item 55\n", "\\item 56\n", "\\item 57\n", "\\item 58\n", "\\item 59\n", "\\item 60\n", "\\item 61\n", "\\item 62\n", "\\item 63\n", "\\item 64\n", "\\item 65\n", "\\item 66\n", "\\item 67\n", "\\item 68\n", "\\item 69\n", "\\item 70\n", "\\item 71\n", "\\item 72\n", "\\item 73\n", "\\item 74\n", "\\item 75\n", "\\item 76\n", "\\item 77\n", "\\item 78\n", "\\item 79\n", "\\item 80\n", "\\item 81\n", "\\item 82\n", "\\item 83\n", "\\item 84\n", "\\item 85\n", "\\item 86\n", "\\item 87\n", "\\item 88\n", "\\item 89\n", "\\item 90\n", "\\item 91\n", "\\item 92\n", "\\item 93\n", "\\item 94\n", "\\item 95\n", "\\item 96\n", "\\item 97\n", "\\item 98\n", "\\item 99\n", "\\item 100\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 1\n", "2. 2\n", "3. 3\n", "4. 4\n", "5. 5\n", "6. 6\n", "7. 7\n", "8. 8\n", "9. 9\n", "10. 10\n", "11. 11\n", "12. 12\n", "13. 13\n", "14. 14\n", "15. 15\n", "16. 16\n", "17. 17\n", "18. 18\n", "19. 19\n", "20. 20\n", "21. 21\n", "22. 22\n", "23. 23\n", "24. 24\n", "25. 25\n", "26. 26\n", "27. 27\n", "28. 28\n", "29. 29\n", "30. 30\n", "31. 31\n", "32. 32\n", "33. 33\n", "34. 34\n", "35. 35\n", "36. 36\n", "37. 37\n", "38. 38\n", "39. 39\n", "40. 40\n", "41. 41\n", "42. 42\n", "43. 43\n", "44. 44\n", "45. 45\n", "46. 46\n", "47. 47\n", "48. 48\n", "49. 49\n", "50. 50\n", "51. 51\n", "52. 52\n", "53. 53\n", "54. 54\n", "55. 55\n", "56. 56\n", "57. 57\n", "58. 58\n", "59. 59\n", "60. 60\n", "61. 61\n", "62. 62\n", "63. 63\n", "64. 64\n", "65. 65\n", "66. 66\n", "67. 67\n", "68. 68\n", "69. 69\n", "70. 70\n", "71. 71\n", "72. 72\n", "73. 73\n", "74. 74\n", "75. 75\n", "76. 76\n", "77. 77\n", "78. 78\n", "79. 79\n", "80. 80\n", "81. 81\n", "82. 82\n", "83. 83\n", "84. 84\n", "85. 85\n", "86. 86\n", "87. 87\n", "88. 88\n", "89. 89\n", "90. 90\n", "91. 91\n", "92. 92\n", "93. 93\n", "94. 94\n", "95. 95\n", "96. 96\n", "97. 97\n", "98. 98\n", "99. 99\n", "100. 100\n", "\n", "\n" ], "text/plain": [ " [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18\n", " [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36\n", " [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54\n", " [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72\n", " [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90\n", " [91] 91 92 93 94 95 96 97 98 99 100" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "seq(1,100)" ] }, { "cell_type": "code", "execution_count": 3, "id": "3f52bc96", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
1. 100
2. 99
3. 98
4. 97
5. 96
6. 95
7. 94
8. 93
9. 92
10. 91
11. 90
12. 89
13. 88
14. 87
15. 86
16. 85
17. 84
18. 83
19. 82
20. 81
21. 80
22. 79
23. 78
24. 77
25. 76
26. 75
27. 74
28. 73
29. 72
30. 71
31. 70
32. 69
33. 68
34. 67
35. 66
36. 65
37. 64
38. 63
39. 62
40. 61
41. 60
42. 59
43. 58
44. 57
45. 56
46. 55
47. 54
48. 53
49. 52
50. 51
51. 50
52. 49
53. 48
54. 47
55. 46
56. 45
57. 44
58. 43
59. 42
60. 41
61. 40
62. 39
63. 38
64. 37
65. 36
66. 35
67. 34
68. 33
69. 32
70. 31
71. 30
72. 29
73. 28
74. 27
75. 26
76. 25
77. 24
78. 23
79. 22
80. 21
81. 20
82. 19
83. 18
84. 17
85. 16
86. 15
87. 14
88. 13
89. 12
90. 11
91. 10
92. 9
93. 8
94. 7
95. 6
96. 5
97. 4
98. 3
99. 2
100. 1
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 100\n", "\\item 99\n", "\\item 98\n", "\\item 97\n", "\\item 96\n", "\\item 95\n", "\\item 94\n", "\\item 93\n", "\\item 92\n", "\\item 91\n", "\\item 90\n", "\\item 89\n", "\\item 88\n", "\\item 87\n", "\\item 86\n", "\\item 85\n", "\\item 84\n", "\\item 83\n", "\\item 82\n", "\\item 81\n", "\\item 80\n", "\\item 79\n", "\\item 78\n", "\\item 77\n", "\\item 76\n", "\\item 75\n", "\\item 74\n", "\\item 73\n", "\\item 72\n", "\\item 71\n", "\\item 70\n", "\\item 69\n", "\\item 68\n", "\\item 67\n", "\\item 66\n", "\\item 65\n", "\\item 64\n", "\\item 63\n", "\\item 62\n", "\\item 61\n", "\\item 60\n", "\\item 59\n", "\\item 58\n", "\\item 57\n", "\\item 56\n", "\\item 55\n", "\\item 54\n", "\\item 53\n", "\\item 52\n", "\\item 51\n", "\\item 50\n", "\\item 49\n", "\\item 48\n", "\\item 47\n", "\\item 46\n", "\\item 45\n", "\\item 44\n", "\\item 43\n", "\\item 42\n", "\\item 41\n", "\\item 40\n", "\\item 39\n", "\\item 38\n", "\\item 37\n", "\\item 36\n", "\\item 35\n", "\\item 34\n", "\\item 33\n", "\\item 32\n", "\\item 31\n", "\\item 30\n", "\\item 29\n", "\\item 28\n", "\\item 27\n", "\\item 26\n", "\\item 25\n", "\\item 24\n", "\\item 23\n", "\\item 22\n", "\\item 21\n", "\\item 20\n", "\\item 19\n", "\\item 18\n", "\\item 17\n", "\\item 16\n", "\\item 15\n", "\\item 14\n", "\\item 13\n", "\\item 12\n", "\\item 11\n", "\\item 10\n", "\\item 9\n", "\\item 8\n", "\\item 7\n", "\\item 6\n", "\\item 5\n", "\\item 4\n", "\\item 3\n", "\\item 2\n", "\\item 1\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 100\n", "2. 99\n", "3. 98\n", "4. 97\n", "5. 96\n", "6. 95\n", "7. 94\n", "8. 93\n", "9. 92\n", "10. 91\n", "11. 90\n", "12. 89\n", "13. 88\n", "14. 87\n", "15. 86\n", "16. 85\n", "17. 84\n", "18. 83\n", "19. 82\n", "20. 81\n", "21. 80\n", "22. 79\n", "23. 78\n", "24. 77\n", "25. 76\n", "26. 75\n", "27. 74\n", "28. 73\n", "29. 72\n", "30. 71\n", "31. 70\n", "32. 69\n", "33. 68\n", "34. 67\n", "35. 66\n", "36. 65\n", "37. 64\n", "38. 63\n", "39. 62\n", "40. 61\n", "41. 60\n", "42. 59\n", "43. 58\n", "44. 57\n", "45. 56\n", "46. 55\n", "47. 54\n", "48. 53\n", "49. 52\n", "50. 51\n", "51. 50\n", "52. 49\n", "53. 48\n", "54. 47\n", "55. 46\n", "56. 45\n", "57. 44\n", "58. 43\n", "59. 42\n", "60. 41\n", "61. 40\n", "62. 39\n", "63. 38\n", "64. 37\n", "65. 36\n", "66. 35\n", "67. 34\n", "68. 33\n", "69. 32\n", "70. 31\n", "71. 30\n", "72. 29\n", "73. 28\n", "74. 27\n", "75. 26\n", "76. 25\n", "77. 24\n", "78. 23\n", "79. 22\n", "80. 21\n", "81. 20\n", "82. 19\n", "83. 18\n", "84. 17\n", "85. 16\n", "86. 15\n", "87. 14\n", "88. 13\n", "89. 12\n", "90. 11\n", "91. 10\n", "92. 9\n", "93. 8\n", "94. 7\n", "95. 6\n", "96. 5\n", "97. 4\n", "98. 3\n", "99. 2\n", "100. 1\n", "\n", "\n" ], "text/plain": [ " [1] 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83\n", " [19] 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65\n", " [37] 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47\n", " [55] 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29\n", " [73] 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11\n", " [91] 10 9 8 7 6 5 4 3 2 1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "seq(100,1)" ] }, { "cell_type": "code", "execution_count": 5, "id": "daacea4d", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
1. 10
2. 10
3. 10
4. 10
5. 10
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 10\n", "\\item 10\n", "\\item 10\n", "\\item 10\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 10\n", "2. 10\n", "3. 10\n", "4. 10\n", "5. 10\n", "\n", "\n" ], "text/plain": [ "[1] 10 10 10 10 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rep(10,5)" ] }, { "cell_type": "code", "execution_count": 6, "id": "5423c9cb", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
1. 1
2. 2
3. 3
4. 1
5. 2
6. 3
7. 1
8. 2
9. 3
10. 1
11. 2
12. 3
13. 1
14. 2
15. 3
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 1\n", "\\item 2\n", "\\item 3\n", "\\item 1\n", "\\item 2\n", "\\item 3\n", "\\item 1\n", "\\item 2\n", "\\item 3\n", "\\item 1\n", "\\item 2\n", "\\item 3\n", "\\item 1\n", "\\item 2\n", "\\item 3\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 1\n", "2. 2\n", "3. 3\n", "4. 1\n", "5. 2\n", "6. 3\n", "7. 1\n", "8. 2\n", "9. 3\n", "10. 1\n", "11. 2\n", "12. 3\n", "13. 1\n", "14. 2\n", "15. 3\n", "\n", "\n" ], "text/plain": [ " [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rep(c(1, 2, 3), 5)" ] }, { "cell_type": "markdown", "id": "3a8613a0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Vectors\n", "\n", "- Numerical vectors\n", "\n", "- Logical vectors\n", "\n", "- Characters\n", "\n", "- Length of a vector\n", "\n", "- Vector calculations" ] }, { "cell_type": "markdown", "id": "3ed2c522", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Mathematical functions\n", "\n", "- sqrt(), log()\n", "\n", "- sin(),cos(), tan()" ] }, { "cell_type": "markdown", "id": "7943100a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Vectors and matrices\n", "\n", "### Matrices\n", "\n", "- Create a matrix: matrix()\n", "\n", "- Dimension of a matrix: dim()\n", "\n", "- How many elements in a matrix: length()\n", "\n", "- Extract elements from a matrix.\n", "\n", "- Replace elements with new entries.\n", "\n", "- Create special matrices: diagonal matrix, identity matrix, zero\n", " matrix\\..." ] }, { "cell_type": "markdown", "id": "d471849b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Matrix multiplications: %*%\n", "\n", "- Matrix inverse: solve()\n", "\n", "- Transpose of a matrix: t()\n", "\n", "- Element-wise operation with a matrix.\n", "\n", "- Combine two or more matrices: rbind(), cbind()" ] }, { "cell_type": "markdown", "id": "75033dc2", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Array\n", "\n", "- An array is a high dimensional matrix.\n", "\n", "- A matrix is a special case of an array when the dimension is two.\n", "\n", "- A vector is a special array when their is no dimension (In R the\n", " dimension is usually dropped in this situation)" ] }, { "cell_type": "markdown", "id": "f824b151", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### List\n", "\n", "- Special data structure that matrix could not handle.\n", "\n", " - Data length are not the same.\n", "\n", " - Data type are not the same.\n", "\n", " - Nested data structure within a list.\n", "\n", "- Create a list: list()\n", "\n", "- Extract elements of a list: [[]] or $\n", "\n", "- Delete an element within a list: set NULL to that element." ] }, { "cell_type": "markdown", "id": "c5a0ff2a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Data frame\n", "\n", "- data.frame(): tightly coupled collections of variables which share\n", " many of the properties of matrices and of lists, used as the\n", " fundamental data structure by most of R's modeling software.\n", "\n", "- In most cases, the operation with a data frame is similar to matrix\n", " operation." ] }, { "cell_type": "markdown", "id": "55ff6fdb", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- See also dplyr package.\n", "\n", " - written by Hadley Wickham of RStudio\n", "\n", " - everything dplyr does could already be done with base R, but it\n", " greatly simplifies existing functionality in R.\n", "\n", " - it provides a \\\"grammar\\\" (in particular, verbs) for data\n", " manipulation and for operating on data frames.\n", "\n", " - the dplyr functions are very fast, as many key operations are\n", " coded in C++." ] }, { "cell_type": "markdown", "id": "67c9468d", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Discussion\n", "\n", "What type of data structure would you choose when you meet the following\n", "situations.\n", "\n", "- Data are of the same length but different types.\n", "\n", "- Data are not of the same length.\n", "\n", "- Hierarchical structure of the data." ] }, { "cell_type": "markdown", "id": "651e3e65", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Managing data frames with the dplyr package\n", "\n", "\n", "### dplyr Grammar\n", "\n", "Some of the key \"verbs\" provided by the dplyr package are\n", "\n", "* select: return a subset of the columns of a data frame, using a flexible notation\n", "* filter: extract a subset of rows from a data frame based on logical conditions\n", "* arrange: reorder rows of a data frame\n", "* rename: rename variables in a data frame\n", "* mutate: add new variables/columns or transform existing variables\n", "* summarise / summarize: generate summary statistics of different\n", " variables in the data frame, possibly within strata\n", "* %>%: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n" ] }, { "cell_type": "markdown", "id": "f30cd3a0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Common dplyr Function Properties\n", "\n", "All of the functions have a few common characteristics. In particular,\n", "\n", "- The first argument is a data frame.\n", "\n", "- The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names).\n", "\n", "- The return result of a function is a new data frame.\n", "\n", "- Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation." ] }, { "cell_type": "markdown", "id": "4e004ab4", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Installing the dplyr package \n", " \n", " install.packages(\"dplyr\")\n", "\n", "- After installing the package it is important that you load it into your R session with the library() function." ] }, { "cell_type": "code", "execution_count": 9, "id": "47c53ff8", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "library(dplyr)" ] }, { "cell_type": "markdown", "id": "4dcfd723", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### select()\n", "\n", "- We will use a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S." ] }, { "cell_type": "code", "execution_count": 40, "id": "34c35340", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
1. 6940
2. 8
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 6940\n", "\\item 8\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 6940\n", "2. 8\n", "\n", "\n" ], "text/plain": [ "[1] 6940 8" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "'data.frame':\t6940 obs. of 8 variables:\n", " $city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n", "$ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n", " $dptp : num 31.5 29.9 27.4 28.6 28.9 ...\n", "$ date : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n", " $pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n", "$ pm10tmean2: num 34 NA 34.2 47 NA ...\n", " $o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n", "$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n" ] } ], "source": [ "chicago <- readRDS(\"data/chicago.rds\")\n", "dim(chicago)\n", "str(chicago)" ] }, { "cell_type": "markdown", "id": "292e3983", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Sometimes you may want to use only a couple of variables out of many." ] }, { "cell_type": "code", "execution_count": 14, "id": "59939174", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
1. 'city'
2. 'tmpd'
3. 'dptp'
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 'city'\n", "\\item 'tmpd'\n", "\\item 'dptp'\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 'city'\n", "2. 'tmpd'\n", "3. 'dptp'\n", "\n", "\n" ], "text/plain": [ "[1] \"city\" \"tmpd\" \"dptp\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 3
citytmpddptp
<chr><dbl><dbl>
1chic31.531.500
2chic33.029.875
3chic33.027.375
4chic29.028.625
5chic32.028.875
6chic40.035.125
\n" ], "text/latex": [ "A data.frame: 6 × 3\n", "\\begin{tabular}{r|lll}\n", " & city & tmpd & dptp\\\\\n", " & & & \\\\\n", "\\hline\n", "\t1 & chic & 31.5 & 31.500\\\\\n", "\t2 & chic & 33.0 & 29.875\\\\\n", "\t3 & chic & 33.0 & 27.375\\\\\n", "\t4 & chic & 29.0 & 28.625\\\\\n", "\t5 & chic & 32.0 & 28.875\\\\\n", "\t6 & chic & 40.0 & 35.125\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 3\n", "\n", "| | city <chr> | tmpd <dbl> | dptp <dbl> |\n", "|---|---|---|---|\n", "| 1 | chic | 31.5 | 31.500 |\n", "| 2 | chic | 33.0 | 29.875 |\n", "| 3 | chic | 33.0 | 27.375 |\n", "| 4 | chic | 29.0 | 28.625 |\n", "| 5 | chic | 32.0 | 28.875 |\n", "| 6 | chic | 40.0 | 35.125 |\n", "\n" ], "text/plain": [ " city tmpd dptp \n", "1 chic 31.5 31.500\n", "2 chic 33.0 29.875\n", "3 chic 33.0 27.375\n", "4 chic 29.0 28.625\n", "5 chic 32.0 28.875\n", "6 chic 40.0 35.125" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "names(chicago)[1:3]\n", "subset <- select(chicago, city:dptp)\n", "head(subset)" ] }, { "cell_type": "markdown", "id": "1e153a64", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Sometimes you may want to drop some variables that are not useful.\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "af2e4c75", "metadata": { "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6940 × 5
datepm25tmean2pm10tmean2o3tmean2no2tmean2
<date><dbl><dbl><dbl><dbl>
11987-01-01NA34.00000 4.25000019.98810
21987-01-02NA NA 3.30434823.19099
31987-01-03NA34.16667 3.33333323.81548
41987-01-04NA47.00000 4.37500030.43452
51987-01-05NA NA 4.75000030.33333
61987-01-06NA48.00000 5.83333325.77233
71987-01-07NA41.00000 9.29166720.58171
81987-01-08NA36.0000011.29166717.03723
91987-01-09NA33.28571 4.50000023.38889
101987-01-10NA NA 4.95833319.54167
111987-01-11NA22.0000017.54166713.70139
121987-01-12NA26.00000 8.00000033.02083
131987-01-13NA53.00000 4.95833338.06142
141987-01-14NA43.00000 4.20833332.19444
151987-01-15NA28.83333 4.45833318.87131
161987-01-16NA19.00000 7.91666719.46667
171987-01-17NA NA 5.83333320.70833
181987-01-18NA39.00000 6.37500021.03333
191987-01-19NA32.0000014.87500017.17409
201987-01-20NA38.00000 7.25000021.61021
211987-01-21NA32.85714 8.91304324.52083
221987-01-22NA52.0000010.50000016.98798
231987-01-23NA55.0000014.62500014.66250
241987-01-24NA38.0000010.08333318.69167
251987-01-25NA NA 6.66666726.30417
261987-01-26NA71.00000 4.58333332.42143
271987-01-27NA39.33333 6.00000030.69306
281987-01-28NA47.00000 6.87500029.12943
291987-01-29NA35.00000 2.91666728.14529
301987-01-30NA59.00000 8.79166719.79861
69112005-12-02 NA19.50 9.15625023.29167
69122005-12-0313.3428620.0010.33333325.19444
69132005-12-0415.3000015.5013.17708321.70833
69142005-12-05 NA30.00 6.44791728.38889
69152005-12-0624.6166733.00 4.70154029.08333
69162005-12-0737.8000039.00 3.91621434.30952
69172005-12-0824.3000031.00 5.99526534.22222
69182005-12-0925.4500022.00 5.95833331.41667
69192005-12-1018.2000030.00 9.13541728.70833
69202005-12-1110.6000014.0011.33333322.55556
69212005-12-1219.2250028.75 5.03125039.74621
69222005-12-1326.5000021.00 6.62862329.56944
69232005-12-1426.9000016.00 3.80208330.63384
69242005-12-1514.4000016.50 4.89583325.43056
69252005-12-1611.0000022.0011.16666716.87500
69262005-12-1713.8000020.00 8.59375020.73611
69272005-12-1812.2000017.5013.55208319.11111
69282005-12-1921.1500021.00 8.05887731.79167
69292005-12-2025.7500032.00 3.84918532.89773
69302005-12-2137.9285759.50 3.66394934.86111
69312005-12-2236.6500042.50 5.38541733.73026
69322005-12-2332.9000034.50 6.90625029.08333
69332005-12-2430.7714325.20 1.77083331.98611
69342005-12-25 6.70000 8.0014.35416713.79167
69352005-12-26 8.40000 8.5014.04166716.81944
69362005-12-2723.5600027.00 4.46875023.50000
69372005-12-2817.7500027.50 3.26041719.28563
69382005-12-29 7.4500023.50 6.79483719.97222
69392005-12-3015.0571419.20 3.03442022.80556
69402005-12-3115.0000023.50 2.53125013.25000
\n" ], "text/latex": [ "A data.frame: 6940 × 5\n", "\\begin{tabular}{r|lllll}\n", " & date & pm25tmean2 & pm10tmean2 & o3tmean2 & no2tmean2\\\\\n", " & & & & & \\\\\n", "\\hline\n", "\t1 & 1987-01-01 & NA & 34.00000 & 4.250000 & 19.98810\\\\\n", "\t2 & 1987-01-02 & NA & NA & 3.304348 & 23.19099\\\\\n", "\t3 & 1987-01-03 & NA & 34.16667 & 3.333333 & 23.81548\\\\\n", "\t4 & 1987-01-04 & NA & 47.00000 & 4.375000 & 30.43452\\\\\n", "\t5 & 1987-01-05 & NA & NA & 4.750000 & 30.33333\\\\\n", "\t6 & 1987-01-06 & NA & 48.00000 & 5.833333 & 25.77233\\\\\n", "\t7 & 1987-01-07 & NA & 41.00000 & 9.291667 & 20.58171\\\\\n", "\t8 & 1987-01-08 & NA & 36.00000 & 11.291667 & 17.03723\\\\\n", "\t9 & 1987-01-09 & NA & 33.28571 & 4.500000 & 23.38889\\\\\n", "\t10 & 1987-01-10 & NA & NA & 4.958333 & 19.54167\\\\\n", "\t11 & 1987-01-11 & NA & 22.00000 & 17.541667 & 13.70139\\\\\n", "\t12 & 1987-01-12 & NA & 26.00000 & 8.000000 & 33.02083\\\\\n", "\t13 & 1987-01-13 & NA & 53.00000 & 4.958333 & 38.06142\\\\\n", "\t14 & 1987-01-14 & NA & 43.00000 & 4.208333 & 32.19444\\\\\n", "\t15 & 1987-01-15 & NA & 28.83333 & 4.458333 & 18.87131\\\\\n", "\t16 & 1987-01-16 & NA & 19.00000 & 7.916667 & 19.46667\\\\\n", "\t17 & 1987-01-17 & NA & NA & 5.833333 & 20.70833\\\\\n", "\t18 & 1987-01-18 & NA & 39.00000 & 6.375000 & 21.03333\\\\\n", "\t19 & 1987-01-19 & NA & 32.00000 & 14.875000 & 17.17409\\\\\n", "\t20 & 1987-01-20 & NA & 38.00000 & 7.250000 & 21.61021\\\\\n", "\t21 & 1987-01-21 & NA & 32.85714 & 8.913043 & 24.52083\\\\\n", "\t22 & 1987-01-22 & NA & 52.00000 & 10.500000 & 16.98798\\\\\n", "\t23 & 1987-01-23 & NA & 55.00000 & 14.625000 & 14.66250\\\\\n", "\t24 & 1987-01-24 & NA & 38.00000 & 10.083333 & 18.69167\\\\\n", "\t25 & 1987-01-25 & NA & NA & 6.666667 & 26.30417\\\\\n", "\t26 & 1987-01-26 & NA & 71.00000 & 4.583333 & 32.42143\\\\\n", "\t27 & 1987-01-27 & NA & 39.33333 & 6.000000 & 30.69306\\\\\n", "\t28 & 1987-01-28 & NA & 47.00000 & 6.875000 & 29.12943\\\\\n", "\t29 & 1987-01-29 & NA & 35.00000 & 2.916667 & 28.14529\\\\\n", "\t30 & 1987-01-30 & NA & 59.00000 & 8.791667 & 19.79861\\\\\n", "\t⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮\\\\\n", "\t6911 & 2005-12-02 & NA & 19.50 & 9.156250 & 23.29167\\\\\n", "\t6912 & 2005-12-03 & 13.34286 & 20.00 & 10.333333 & 25.19444\\\\\n", "\t6913 & 2005-12-04 & 15.30000 & 15.50 & 13.177083 & 21.70833\\\\\n", "\t6914 & 2005-12-05 & NA & 30.00 & 6.447917 & 28.38889\\\\\n", "\t6915 & 2005-12-06 & 24.61667 & 33.00 & 4.701540 & 29.08333\\\\\n", "\t6916 & 2005-12-07 & 37.80000 & 39.00 & 3.916214 & 34.30952\\\\\n", "\t6917 & 2005-12-08 & 24.30000 & 31.00 & 5.995265 & 34.22222\\\\\n", "\t6918 & 2005-12-09 & 25.45000 & 22.00 & 5.958333 & 31.41667\\\\\n", "\t6919 & 2005-12-10 & 18.20000 & 30.00 & 9.135417 & 28.70833\\\\\n", "\t6920 & 2005-12-11 & 10.60000 & 14.00 & 11.333333 & 22.55556\\\\\n", "\t6921 & 2005-12-12 & 19.22500 & 28.75 & 5.031250 & 39.74621\\\\\n", "\t6922 & 2005-12-13 & 26.50000 & 21.00 & 6.628623 & 29.56944\\\\\n", "\t6923 & 2005-12-14 & 26.90000 & 16.00 & 3.802083 & 30.63384\\\\\n", "\t6924 & 2005-12-15 & 14.40000 & 16.50 & 4.895833 & 25.43056\\\\\n", "\t6925 & 2005-12-16 & 11.00000 & 22.00 & 11.166667 & 16.87500\\\\\n", "\t6926 & 2005-12-17 & 13.80000 & 20.00 & 8.593750 & 20.73611\\\\\n", "\t6927 & 2005-12-18 & 12.20000 & 17.50 & 13.552083 & 19.11111\\\\\n", "\t6928 & 2005-12-19 & 21.15000 & 21.00 & 8.058877 & 31.79167\\\\\n", "\t6929 & 2005-12-20 & 25.75000 & 32.00 & 3.849185 & 32.89773\\\\\n", "\t6930 & 2005-12-21 & 37.92857 & 59.50 & 3.663949 & 34.86111\\\\\n", "\t6931 & 2005-12-22 & 36.65000 & 42.50 & 5.385417 & 33.73026\\\\\n", "\t6932 & 2005-12-23 & 32.90000 & 34.50 & 6.906250 & 29.08333\\\\\n", "\t6933 & 2005-12-24 & 30.77143 & 25.20 & 1.770833 & 31.98611\\\\\n", "\t6934 & 2005-12-25 & 6.70000 & 8.00 & 14.354167 & 13.79167\\\\\n", "\t6935 & 2005-12-26 & 8.40000 & 8.50 & 14.041667 & 16.81944\\\\\n", "\t6936 & 2005-12-27 & 23.56000 & 27.00 & 4.468750 & 23.50000\\\\\n", "\t6937 & 2005-12-28 & 17.75000 & 27.50 & 3.260417 & 19.28563\\\\\n", "\t6938 & 2005-12-29 & 7.45000 & 23.50 & 6.794837 & 19.97222\\\\\n", "\t6939 & 2005-12-30 & 15.05714 & 19.20 & 3.034420 & 22.80556\\\\\n", "\t6940 & 2005-12-31 & 15.00000 & 23.50 & 2.531250 & 13.25000\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6940 × 5\n", "\n", "| | date <date> | pm25tmean2 <dbl> | pm10tmean2 <dbl> | o3tmean2 <dbl> | no2tmean2 <dbl> |\n", "|---|---|---|---|---|---|\n", "| 1 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |\n", "| 2 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |\n", "| 3 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |\n", "| 4 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |\n", "| 5 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 |\n", "| 6 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |\n", "| 7 | 1987-01-07 | NA | 41.00000 | 9.291667 | 20.58171 |\n", "| 8 | 1987-01-08 | NA | 36.00000 | 11.291667 | 17.03723 |\n", "| 9 | 1987-01-09 | NA | 33.28571 | 4.500000 | 23.38889 |\n", "| 10 | 1987-01-10 | NA | NA | 4.958333 | 19.54167 |\n", "| 11 | 1987-01-11 | NA | 22.00000 | 17.541667 | 13.70139 |\n", "| 12 | 1987-01-12 | NA | 26.00000 | 8.000000 | 33.02083 |\n", "| 13 | 1987-01-13 | NA | 53.00000 | 4.958333 | 38.06142 |\n", "| 14 | 1987-01-14 | NA | 43.00000 | 4.208333 | 32.19444 |\n", "| 15 | 1987-01-15 | NA | 28.83333 | 4.458333 | 18.87131 |\n", "| 16 | 1987-01-16 | NA | 19.00000 | 7.916667 | 19.46667 |\n", "| 17 | 1987-01-17 | NA | NA | 5.833333 | 20.70833 |\n", "| 18 | 1987-01-18 | NA | 39.00000 | 6.375000 | 21.03333 |\n", "| 19 | 1987-01-19 | NA | 32.00000 | 14.875000 | 17.17409 |\n", "| 20 | 1987-01-20 | NA | 38.00000 | 7.250000 | 21.61021 |\n", "| 21 | 1987-01-21 | NA | 32.85714 | 8.913043 | 24.52083 |\n", "| 22 | 1987-01-22 | NA | 52.00000 | 10.500000 | 16.98798 |\n", "| 23 | 1987-01-23 | NA | 55.00000 | 14.625000 | 14.66250 |\n", "| 24 | 1987-01-24 | NA | 38.00000 | 10.083333 | 18.69167 |\n", "| 25 | 1987-01-25 | NA | NA | 6.666667 | 26.30417 |\n", "| 26 | 1987-01-26 | NA | 71.00000 | 4.583333 | 32.42143 |\n", "| 27 | 1987-01-27 | NA | 39.33333 | 6.000000 | 30.69306 |\n", "| 28 | 1987-01-28 | NA | 47.00000 | 6.875000 | 29.12943 |\n", "| 29 | 1987-01-29 | NA | 35.00000 | 2.916667 | 28.14529 |\n", "| 30 | 1987-01-30 | NA | 59.00000 | 8.791667 | 19.79861 |\n", "| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |\n", "| 6911 | 2005-12-02 | NA | 19.50 | 9.156250 | 23.29167 |\n", "| 6912 | 2005-12-03 | 13.34286 | 20.00 | 10.333333 | 25.19444 |\n", "| 6913 | 2005-12-04 | 15.30000 | 15.50 | 13.177083 | 21.70833 |\n", "| 6914 | 2005-12-05 | NA | 30.00 | 6.447917 | 28.38889 |\n", "| 6915 | 2005-12-06 | 24.61667 | 33.00 | 4.701540 | 29.08333 |\n", "| 6916 | 2005-12-07 | 37.80000 | 39.00 | 3.916214 | 34.30952 |\n", "| 6917 | 2005-12-08 | 24.30000 | 31.00 | 5.995265 | 34.22222 |\n", "| 6918 | 2005-12-09 | 25.45000 | 22.00 | 5.958333 | 31.41667 |\n", "| 6919 | 2005-12-10 | 18.20000 | 30.00 | 9.135417 | 28.70833 |\n", "| 6920 | 2005-12-11 | 10.60000 | 14.00 | 11.333333 | 22.55556 |\n", "| 6921 | 2005-12-12 | 19.22500 | 28.75 | 5.031250 | 39.74621 |\n", "| 6922 | 2005-12-13 | 26.50000 | 21.00 | 6.628623 | 29.56944 |\n", "| 6923 | 2005-12-14 | 26.90000 | 16.00 | 3.802083 | 30.63384 |\n", "| 6924 | 2005-12-15 | 14.40000 | 16.50 | 4.895833 | 25.43056 |\n", "| 6925 | 2005-12-16 | 11.00000 | 22.00 | 11.166667 | 16.87500 |\n", "| 6926 | 2005-12-17 | 13.80000 | 20.00 | 8.593750 | 20.73611 |\n", "| 6927 | 2005-12-18 | 12.20000 | 17.50 | 13.552083 | 19.11111 |\n", "| 6928 | 2005-12-19 | 21.15000 | 21.00 | 8.058877 | 31.79167 |\n", "| 6929 | 2005-12-20 | 25.75000 | 32.00 | 3.849185 | 32.89773 |\n", "| 6930 | 2005-12-21 | 37.92857 | 59.50 | 3.663949 | 34.86111 |\n", "| 6931 | 2005-12-22 | 36.65000 | 42.50 | 5.385417 | 33.73026 |\n", "| 6932 | 2005-12-23 | 32.90000 | 34.50 | 6.906250 | 29.08333 |\n", "| 6933 | 2005-12-24 | 30.77143 | 25.20 | 1.770833 | 31.98611 |\n", "| 6934 | 2005-12-25 | 6.70000 | 8.00 | 14.354167 | 13.79167 |\n", "| 6935 | 2005-12-26 | 8.40000 | 8.50 | 14.041667 | 16.81944 |\n", "| 6936 | 2005-12-27 | 23.56000 | 27.00 | 4.468750 | 23.50000 |\n", "| 6937 | 2005-12-28 | 17.75000 | 27.50 | 3.260417 | 19.28563 |\n", "| 6938 | 2005-12-29 | 7.45000 | 23.50 | 6.794837 | 19.97222 |\n", "| 6939 | 2005-12-30 | 15.05714 | 19.20 | 3.034420 | 22.80556 |\n", "| 6940 | 2005-12-31 | 15.00000 | 23.50 | 2.531250 | 13.25000 |\n", "\n" ], "text/plain": [ " date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n", "1 1987-01-01 NA 34.00000 4.250000 19.98810 \n", "2 1987-01-02 NA NA 3.304348 23.19099 \n", "3 1987-01-03 NA 34.16667 3.333333 23.81548 \n", "4 1987-01-04 NA 47.00000 4.375000 30.43452 \n", "5 1987-01-05 NA NA 4.750000 30.33333 \n", "6 1987-01-06 NA 48.00000 5.833333 25.77233 \n", "7 1987-01-07 NA 41.00000 9.291667 20.58171 \n", "8 1987-01-08 NA 36.00000 11.291667 17.03723 \n", "9 1987-01-09 NA 33.28571 4.500000 23.38889 \n", "10 1987-01-10 NA NA 4.958333 19.54167 \n", "11 1987-01-11 NA 22.00000 17.541667 13.70139 \n", "12 1987-01-12 NA 26.00000 8.000000 33.02083 \n", "13 1987-01-13 NA 53.00000 4.958333 38.06142 \n", "14 1987-01-14 NA 43.00000 4.208333 32.19444 \n", "15 1987-01-15 NA 28.83333 4.458333 18.87131 \n", "16 1987-01-16 NA 19.00000 7.916667 19.46667 \n", "17 1987-01-17 NA NA 5.833333 20.70833 \n", "18 1987-01-18 NA 39.00000 6.375000 21.03333 \n", "19 1987-01-19 NA 32.00000 14.875000 17.17409 \n", "20 1987-01-20 NA 38.00000 7.250000 21.61021 \n", "21 1987-01-21 NA 32.85714 8.913043 24.52083 \n", "22 1987-01-22 NA 52.00000 10.500000 16.98798 \n", "23 1987-01-23 NA 55.00000 14.625000 14.66250 \n", "24 1987-01-24 NA 38.00000 10.083333 18.69167 \n", "25 1987-01-25 NA NA 6.666667 26.30417 \n", "26 1987-01-26 NA 71.00000 4.583333 32.42143 \n", "27 1987-01-27 NA 39.33333 6.000000 30.69306 \n", "28 1987-01-28 NA 47.00000 6.875000 29.12943 \n", "29 1987-01-29 NA 35.00000 2.916667 28.14529 \n", "30 1987-01-30 NA 59.00000 8.791667 19.79861 \n", "⋮ ⋮ ⋮ ⋮ ⋮ ⋮ \n", "6911 2005-12-02 NA 19.50 9.156250 23.29167 \n", "6912 2005-12-03 13.34286 20.00 10.333333 25.19444 \n", "6913 2005-12-04 15.30000 15.50 13.177083 21.70833 \n", "6914 2005-12-05 NA 30.00 6.447917 28.38889 \n", "6915 2005-12-06 24.61667 33.00 4.701540 29.08333 \n", "6916 2005-12-07 37.80000 39.00 3.916214 34.30952 \n", "6917 2005-12-08 24.30000 31.00 5.995265 34.22222 \n", "6918 2005-12-09 25.45000 22.00 5.958333 31.41667 \n", "6919 2005-12-10 18.20000 30.00 9.135417 28.70833 \n", "6920 2005-12-11 10.60000 14.00 11.333333 22.55556 \n", "6921 2005-12-12 19.22500 28.75 5.031250 39.74621 \n", "6922 2005-12-13 26.50000 21.00 6.628623 29.56944 \n", "6923 2005-12-14 26.90000 16.00 3.802083 30.63384 \n", "6924 2005-12-15 14.40000 16.50 4.895833 25.43056 \n", "6925 2005-12-16 11.00000 22.00 11.166667 16.87500 \n", "6926 2005-12-17 13.80000 20.00 8.593750 20.73611 \n", "6927 2005-12-18 12.20000 17.50 13.552083 19.11111 \n", "6928 2005-12-19 21.15000 21.00 8.058877 31.79167 \n", "6929 2005-12-20 25.75000 32.00 3.849185 32.89773 \n", "6930 2005-12-21 37.92857 59.50 3.663949 34.86111 \n", "6931 2005-12-22 36.65000 42.50 5.385417 33.73026 \n", "6932 2005-12-23 32.90000 34.50 6.906250 29.08333 \n", "6933 2005-12-24 30.77143 25.20 1.770833 31.98611 \n", "6934 2005-12-25 6.70000 8.00 14.354167 13.79167 \n", "6935 2005-12-26 8.40000 8.50 14.041667 16.81944 \n", "6936 2005-12-27 23.56000 27.00 4.468750 23.50000 \n", "6937 2005-12-28 17.75000 27.50 3.260417 19.28563 \n", "6938 2005-12-29 7.45000 23.50 6.794837 19.97222 \n", "6939 2005-12-30 15.05714 19.20 3.034420 22.80556 \n", "6940 2005-12-31 15.00000 23.50 2.531250 13.25000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "select(chicago, -(city:dptp))" ] }, { "cell_type": "markdown", "id": "10e5c2c9", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If you wanted to keep every variable that ends with a \"2\", we could do" ] }, { "cell_type": "code", "execution_count": 18, "id": "e6c24fc3", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'data.frame':\t6940 obs. of 4 variables:\n", " $pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n", "$ pm10tmean2: num 34 NA 34.2 47 NA ...\n", " $o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n", "$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n" ] } ], "source": [ "subset <- select(chicago, ends_with(\"2\"))\n", "str(subset)" ] }, { "cell_type": "markdown", "id": "6c5f3342", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Or if we wanted to keep every variable that starts with a \"d\", we could do" ] }, { "cell_type": "code", "execution_count": 20, "id": "1921b790", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'data.frame':\t6940 obs. of 2 variables:\n", " $dptp: num 31.5 29.9 27.4 28.6 28.9 ...\n", "$ date: Date, format: \"1987-01-01\" \"1987-01-02\" ...\n" ] } ], "source": [ "subset <- select(chicago, starts_with(\"d\"))\n", "str(subset)" ] }, { "cell_type": "markdown", "id": "435eaf63", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### filter()\n", "\n", "- The filter() function is used to extract subsets of rows from a data frame.\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "84a28bc2", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'data.frame':\t194 obs. of 8 variables:\n", " $city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n", "$ tmpd : num 23 28 55 59 57 57 75 61 73 78 ...\n", " $dptp : num 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n", "$ date : Date, format: \"1998-01-17\" \"1998-01-23\" ...\n", " $pm25tmean2: num 38.1 34 39.4 35.4 33.3 ...\n", "$ pm10tmean2: num 32.5 38.7 34 28.5 35 ...\n", " $o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ...\n", "$ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ...\n" ] }, { "data": { "text/plain": [ " Min. 1st Qu. Median Mean 3rd Qu. Max. \n", " 30.05 32.12 35.04 36.63 39.53 61.50 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chic.f <- filter(chicago, pm25tmean2 > 30)\n", "str(chic.f)\n", "summary(chic.f$pm25tmean2)" ] }, { "cell_type": "markdown", "id": "f767ec08", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- We could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit." ] }, { "cell_type": "code", "execution_count": 22, "id": "b4b3ff2a", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 17 × 3 datetmpdpm25tmean2 <date><dbl><dbl> 1998-08-238139.60000 1998-09-068131.50000 2001-07-208232.30000 2001-08-018443.70000 2001-08-088538.83750 2001-08-098438.20000 2002-06-208233.00000 2002-06-238242.50000 2002-07-088133.10000 2002-07-188238.85000 2003-06-258233.90000 2003-07-048432.90000 2005-06-248631.85714 2005-06-278251.53750 2005-06-288531.20000 2005-07-178432.70000 2005-08-038437.90000 \n" ], "text/latex": [ "A data.frame: 17 × 3\n", "\\begin{tabular}{lll}\n", " date & tmpd & pm25tmean2\\\\\n", " & & \\\\\n", "\\hline\n", "\t 1998-08-23 & 81 & 39.60000\\\\\n", "\t 1998-09-06 & 81 & 31.50000\\\\\n", "\t 2001-07-20 & 82 & 32.30000\\\\\n", "\t 2001-08-01 & 84 & 43.70000\\\\\n", "\t 2001-08-08 & 85 & 38.83750\\\\\n", "\t 2001-08-09 & 84 & 38.20000\\\\\n", "\t 2002-06-20 & 82 & 33.00000\\\\\n", "\t 2002-06-23 & 82 & 42.50000\\\\\n", "\t 2002-07-08 & 81 & 33.10000\\\\\n", "\t 2002-07-18 & 82 & 38.85000\\\\\n", "\t 2003-06-25 & 82 & 33.90000\\\\\n", "\t 2003-07-04 & 84 & 32.90000\\\\\n", "\t 2005-06-24 & 86 & 31.85714\\\\\n", "\t 2005-06-27 & 82 & 51.53750\\\\\n", "\t 2005-06-28 & 85 & 31.20000\\\\\n", "\t 2005-07-17 & 84 & 32.70000\\\\\n", "\t 2005-08-03 & 84 & 37.90000\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 17 × 3\n", "\n", "| date <date> | tmpd <dbl> | pm25tmean2 <dbl> |\n", "|---|---|---|\n", "| 1998-08-23 | 81 | 39.60000 |\n", "| 1998-09-06 | 81 | 31.50000 |\n", "| 2001-07-20 | 82 | 32.30000 |\n", "| 2001-08-01 | 84 | 43.70000 |\n", "| 2001-08-08 | 85 | 38.83750 |\n", "| 2001-08-09 | 84 | 38.20000 |\n", "| 2002-06-20 | 82 | 33.00000 |\n", "| 2002-06-23 | 82 | 42.50000 |\n", "| 2002-07-08 | 81 | 33.10000 |\n", "| 2002-07-18 | 82 | 38.85000 |\n", "| 2003-06-25 | 82 | 33.90000 |\n", "| 2003-07-04 | 84 | 32.90000 |\n", "| 2005-06-24 | 86 | 31.85714 |\n", "| 2005-06-27 | 82 | 51.53750 |\n", "| 2005-06-28 | 85 | 31.20000 |\n", "| 2005-07-17 | 84 | 32.70000 |\n", "| 2005-08-03 | 84 | 37.90000 |\n", "\n" ], "text/plain": [ " date tmpd pm25tmean2\n", "1 1998-08-23 81 39.60000 \n", "2 1998-09-06 81 31.50000 \n", "3 2001-07-20 82 32.30000 \n", "4 2001-08-01 84 43.70000 \n", "5 2001-08-08 85 38.83750 \n", "6 2001-08-09 84 38.20000 \n", "7 2002-06-20 82 33.00000 \n", "8 2002-06-23 82 42.50000 \n", "9 2002-07-08 81 33.10000 \n", "10 2002-07-18 82 38.85000 \n", "11 2003-06-25 82 33.90000 \n", "12 2003-07-04 84 32.90000 \n", "13 2005-06-24 86 31.85714 \n", "14 2005-06-27 82 51.53750 \n", "15 2005-06-28 85 31.20000 \n", "16 2005-07-17 84 32.70000 \n", "17 2005-08-03 84 37.90000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\n", "select(chic.f, date, tmpd, pm25tmean2)" ] }, { "cell_type": "markdown", "id": "b37953e2", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### arrange()\n", "\n", "The arrange() function is used to reorder rows of a data frame according to one of the variables/columns.\n", "\n", "- Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation." ] }, { "cell_type": "code", "execution_count": 27, "id": "1a734fd9", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 3 × 2 datepm25tmean2 <date><dbl> 11987-01-01NA 21987-01-02NA 31987-01-03NA \n" ], "text/latex": [ "A data.frame: 3 × 2\n", "\\begin{tabular}{r|ll}\n", " & date & pm25tmean2\\\\\n", " & & \\\\\n", "\\hline\n", "\t1 & 1987-01-01 & NA\\\\\n", "\t2 & 1987-01-02 & NA\\\\\n", "\t3 & 1987-01-03 & NA\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 3 × 2\n", "\n", "| | date <date> | pm25tmean2 <dbl> |\n", "|---|---|---|\n", "| 1 | 1987-01-01 | NA |\n", "| 2 | 1987-01-02 | NA |\n", "| 3 | 1987-01-03 | NA |\n", "\n" ], "text/plain": [ " date pm25tmean2\n", "1 1987-01-01 NA \n", "2 1987-01-02 NA \n", "3 1987-01-03 NA " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 3 × 2 datepm25tmean2 <date><dbl> 69382005-12-29 7.45000 69392005-12-3015.05714 69402005-12-3115.00000 \n" ], "text/latex": [ "A data.frame: 3 × 2\n", "\\begin{tabular}{r|ll}\n", " & date & pm25tmean2\\\\\n", " & & \\\\\n", "\\hline\n", "\t6938 & 2005-12-29 & 7.45000\\\\\n", "\t6939 & 2005-12-30 & 15.05714\\\\\n", "\t6940 & 2005-12-31 & 15.00000\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 3 × 2\n", "\n", "| | date <date> | pm25tmean2 <dbl> |\n", "|---|---|---|\n", "| 6938 | 2005-12-29 | 7.45000 |\n", "| 6939 | 2005-12-30 | 15.05714 |\n", "| 6940 | 2005-12-31 | 15.00000 |\n", "\n" ], "text/plain": [ " date pm25tmean2\n", "6938 2005-12-29 7.45000 \n", "6939 2005-12-30 15.05714 \n", "6940 2005-12-31 15.00000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chicago <- arrange(chicago, date)\n", "head(select(chicago, date, pm25tmean2), 3)\n", "tail(select(chicago, date, pm25tmean2), 3)" ] }, { "cell_type": "markdown", "id": "dffd7ce9", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Columns can be arranged in descending order too by useing the special desc() operator. Looking at the first three and last three rows shows the dates in descending order." ] }, { "cell_type": "code", "execution_count": 36, "id": "131db5b6", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 3 × 2 datepm25tmean2 <date><dbl> 12005-12-3115.00000 22005-12-3015.05714 32005-12-29 7.45000 \n" ], "text/latex": [ "A data.frame: 3 × 2\n", "\\begin{tabular}{r|ll}\n", " & date & pm25tmean2\\\\\n", " & & \\\\\n", "\\hline\n", "\t1 & 2005-12-31 & 15.00000\\\\\n", "\t2 & 2005-12-30 & 15.05714\\\\\n", "\t3 & 2005-12-29 & 7.45000\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 3 × 2\n", "\n", "| | date <date> | pm25tmean2 <dbl> |\n", "|---|---|---|\n", "| 1 | 2005-12-31 | 15.00000 |\n", "| 2 | 2005-12-30 | 15.05714 |\n", "| 3 | 2005-12-29 | 7.45000 |\n", "\n" ], "text/plain": [ " date pm25tmean2\n", "1 2005-12-31 15.00000 \n", "2 2005-12-30 15.05714 \n", "3 2005-12-29 7.45000 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 3 × 2 datepm25tmean2 <date><dbl> 69381987-01-03NA 69391987-01-02NA 69401987-01-01NA \n" ], "text/latex": [ "A data.frame: 3 × 2\n", "\\begin{tabular}{r|ll}\n", " & date & pm25tmean2\\\\\n", " & & \\\\\n", "\\hline\n", "\t6938 & 1987-01-03 & NA\\\\\n", "\t6939 & 1987-01-02 & NA\\\\\n", "\t6940 & 1987-01-01 & NA\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 3 × 2\n", "\n", "| | date <date> | pm25tmean2 <dbl> |\n", "|---|---|---|\n", "| 6938 | 1987-01-03 | NA |\n", "| 6939 | 1987-01-02 | NA |\n", "| 6940 | 1987-01-01 | NA |\n", "\n" ], "text/plain": [ " date pm25tmean2\n", "6938 1987-01-03 NA \n", "6939 1987-01-02 NA \n", "6940 1987-01-01 NA " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chicago2 <- arrange(chicago, desc(date))\n", "head(select(chicago2, date, pm25tmean2), 3)\n", "tail(select(chicago2, date, pm25tmean2), 3)" ] }, { "cell_type": "markdown", "id": "7fc22718", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### rename()\n", "\n", "Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function is designed to make this process easier.\n", "\n", "- Here you can see the names of the first five variables in the chicago data frame. Now we rename the awkward variable names.\n" ] }, { "cell_type": "code", "execution_count": 37, "id": "387eaecb", "metadata": { "scrolled": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 3 × 5 citytmpddptpdatepm25tmean2 <chr><dbl><dbl><date><dbl> 1chic31.531.5001987-01-01NA 2chic33.029.8751987-01-02NA 3chic33.027.3751987-01-03NA \n" ], "text/latex": [ "A data.frame: 3 × 5\n", "\\begin{tabular}{r|lllll}\n", " & city & tmpd & dptp & date & pm25tmean2\\\\\n", " & & & & & \\\\\n", "\\hline\n", "\t1 & chic & 31.5 & 31.500 & 1987-01-01 & NA\\\\\n", "\t2 & chic & 33.0 & 29.875 & 1987-01-02 & NA\\\\\n", "\t3 & chic & 33.0 & 27.375 & 1987-01-03 & NA\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 3 × 5\n", "\n", "| | city <chr> | tmpd <dbl> | dptp <dbl> | date <date> | pm25tmean2 <dbl> |\n", "|---|---|---|---|---|---|\n", "| 1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |\n", "| 2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |\n", "| 3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |\n", "\n" ], "text/plain": [ " city tmpd dptp date pm25tmean2\n", "1 chic 31.5 31.500 1987-01-01 NA \n", "2 chic 33.0 29.875 1987-01-02 NA \n", "3 chic 33.0 27.375 1987-01-03 NA " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 3 × 5 citytmpddewpointdatepm25 <chr><dbl><dbl><date><dbl> 1chic31.531.5001987-01-01NA 2chic33.029.8751987-01-02NA 3chic33.027.3751987-01-03NA \n" ], "text/latex": [ "A data.frame: 3 × 5\n", "\\begin{tabular}{r|lllll}\n", " & city & tmpd & dewpoint & date & pm25\\\\\n", " & & & & & \\\\\n", "\\hline\n", "\t1 & chic & 31.5 & 31.500 & 1987-01-01 & NA\\\\\n", "\t2 & chic & 33.0 & 29.875 & 1987-01-02 & NA\\\\\n", "\t3 & chic & 33.0 & 27.375 & 1987-01-03 & NA\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 3 × 5\n", "\n", "| | city <chr> | tmpd <dbl> | dewpoint <dbl> | date <date> | pm25 <dbl> |\n", "|---|---|---|---|---|---|\n", "| 1 | chic | 31.5 | 31.500 | 1987-01-01 | NA |\n", "| 2 | chic | 33.0 | 29.875 | 1987-01-02 | NA |\n", "| 3 | chic | 33.0 | 27.375 | 1987-01-03 | NA |\n", "\n" ], "text/plain": [ " city tmpd dewpoint date pm25\n", "1 chic 31.5 31.500 1987-01-01 NA \n", "2 chic 33.0 29.875 1987-01-02 NA \n", "3 chic 33.0 27.375 1987-01-03 NA " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "head(chicago[, 1:5], 3)\n", "chicago3 <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\n", "head(chicago3[, 1:5], 3) # with new variable name" ] }, { "cell_type": "markdown", "id": "fb427d38", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### mutate()\n", "\n", "The mutate() function exists to compute transformations of variables in a data frame.\n", "\n", "- For example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data." ] }, { "cell_type": "code", "execution_count": 42, "id": "5713cbfe", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", " A data.frame: 6 × 9 citytmpddewpointdatepm25pm10tmean2o3tmean2no2tmean2pm25detrend <chr><dbl><dbl><date><dbl><dbl><dbl><dbl><dbl> 1chic31.531.5001987-01-01NA34.000004.25000019.98810NA 2chic33.029.8751987-01-02NA NA3.30434823.19099NA 3chic33.027.3751987-01-03NA34.166673.33333323.81548NA 4chic29.028.6251987-01-04NA47.000004.37500030.43452NA 5chic32.028.8751987-01-05NA NA4.75000030.33333NA 6chic40.035.1251987-01-06NA48.000005.83333325.77233NA \n" ], "text/latex": [ "A data.frame: 6 × 9\n", "\\begin{tabular}{r|lllllllll}\n", " & city & tmpd & dewpoint & date & pm25 & pm10tmean2 & o3tmean2 & no2tmean2 & pm25detrend\\\\\n", " & & & & & & & & & \\\\\n", "\\hline\n", "\t1 & chic & 31.5 & 31.500 & 1987-01-01 & NA & 34.00000 & 4.250000 & 19.98810 & NA\\\\\n", "\t2 & chic & 33.0 & 29.875 & 1987-01-02 & NA & NA & 3.304348 & 23.19099 & NA\\\\\n", "\t3 & chic & 33.0 & 27.375 & 1987-01-03 & NA & 34.16667 & 3.333333 & 23.81548 & NA\\\\\n", "\t4 & chic & 29.0 & 28.625 & 1987-01-04 & NA & 47.00000 & 4.375000 & 30.43452 & NA\\\\\n", "\t5 & chic & 32.0 & 28.875 & 1987-01-05 & NA & NA & 4.750000 & 30.33333 & NA\\\\\n", "\t6 & chic & 40.0 & 35.125 & 1987-01-06 & NA & 48.00000 & 5.833333 & 25.77233 & NA\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 9\n", "\n", "| | city <chr> | tmpd <dbl> | dewpoint <dbl> | date <date> | pm25 <dbl> | pm10tmean2 <dbl> | o3tmean2 <dbl> | no2tmean2 <dbl> | pm25detrend <dbl> |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 1 | chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 | NA |\n", "| 2 | chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 | NA |\n", "| 3 | chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 | NA |\n", "| 4 | chic | 29.0 | 28.625 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 | NA |\n", "| 5 | chic | 32.0 | 28.875 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 | NA |\n", "| 6 | chic | 40.0 | 35.125 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 | NA |\n", "\n" ], "text/plain": [ " city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2 pm25detrend\n", "1 chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810 NA \n", "2 chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099 NA \n", "3 chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548 NA \n", "4 chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452 NA \n", "5 chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333 NA \n", "6 chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233 NA " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chicago4 <- mutate(chicago3, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\n", "head(chicago4)" ] }, { "cell_type": "markdown", "id": "1af51119", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### group_by()\n", "\n", "The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is.\n", "\n", "- First, we can create a year variable using as.POSIXlt().\n", "\n", "- Now we can create a separate data frame that splits the original data frame by year." ] }, { "cell_type": "code", "execution_count": 45, "id": "0090cff3", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", " A tibble: 19 × 4 yearpm25o3no2 <dbl><dbl><dbl><dbl> 1987 NaN62.9696623.49369 1988 NaN61.6770824.52296 1989 NaN59.7272726.14062 1990 NaN52.2291722.59583 1991 NaN63.1041721.38194 1992 NaN50.8287024.78921 1993 NaN44.3009325.76993 1994 NaN52.1784428.47500 1995 NaN66.5875027.26042 1996 NaN58.3958326.38715 1997 NaN56.5416725.48143 199818.2646750.6625024.58649 199918.4964657.4886424.66667 200016.9380655.7610323.46082 200116.9263251.8198425.06522 200215.2733554.8804322.73750 200315.2318356.1660824.62500 200414.6286444.4824023.39130 200516.1855658.8412622.62387 \n" ], "text/latex": [ "A tibble: 19 × 4\n", "\\begin{tabular}{llll}\n", " year & pm25 & o3 & no2\\\\\n", " & & & \\\\\n", "\\hline\n", "\t 1987 & NaN & 62.96966 & 23.49369\\\\\n", "\t 1988 & NaN & 61.67708 & 24.52296\\\\\n", "\t 1989 & NaN & 59.72727 & 26.14062\\\\\n", "\t 1990 & NaN & 52.22917 & 22.59583\\\\\n", "\t 1991 & NaN & 63.10417 & 21.38194\\\\\n", "\t 1992 & NaN & 50.82870 & 24.78921\\\\\n", "\t 1993 & NaN & 44.30093 & 25.76993\\\\\n", "\t 1994 & NaN & 52.17844 & 28.47500\\\\\n", "\t 1995 & NaN & 66.58750 & 27.26042\\\\\n", "\t 1996 & NaN & 58.39583 & 26.38715\\\\\n", "\t 1997 & NaN & 56.54167 & 25.48143\\\\\n", "\t 1998 & 18.26467 & 50.66250 & 24.58649\\\\\n", "\t 1999 & 18.49646 & 57.48864 & 24.66667\\\\\n", "\t 2000 & 16.93806 & 55.76103 & 23.46082\\\\\n", "\t 2001 & 16.92632 & 51.81984 & 25.06522\\\\\n", "\t 2002 & 15.27335 & 54.88043 & 22.73750\\\\\n", "\t 2003 & 15.23183 & 56.16608 & 24.62500\\\\\n", "\t 2004 & 14.62864 & 44.48240 & 23.39130\\\\\n", "\t 2005 & 16.18556 & 58.84126 & 22.62387\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 19 × 4\n", "\n", "| year <dbl> | pm25 <dbl> | o3 <dbl> | no2 <dbl> |\n", "|---|---|---|---|\n", "| 1987 | NaN | 62.96966 | 23.49369 |\n", "| 1988 | NaN | 61.67708 | 24.52296 |\n", "| 1989 | NaN | 59.72727 | 26.14062 |\n", "| 1990 | NaN | 52.22917 | 22.59583 |\n", "| 1991 | NaN | 63.10417 | 21.38194 |\n", "| 1992 | NaN | 50.82870 | 24.78921 |\n", "| 1993 | NaN | 44.30093 | 25.76993 |\n", "| 1994 | NaN | 52.17844 | 28.47500 |\n", "| 1995 | NaN | 66.58750 | 27.26042 |\n", "| 1996 | NaN | 58.39583 | 26.38715 |\n", "| 1997 | NaN | 56.54167 | 25.48143 |\n", "| 1998 | 18.26467 | 50.66250 | 24.58649 |\n", "| 1999 | 18.49646 | 57.48864 | 24.66667 |\n", "| 2000 | 16.93806 | 55.76103 | 23.46082 |\n", "| 2001 | 16.92632 | 51.81984 | 25.06522 |\n", "| 2002 | 15.27335 | 54.88043 | 22.73750 |\n", "| 2003 | 15.23183 | 56.16608 | 24.62500 |\n", "| 2004 | 14.62864 | 44.48240 | 23.39130 |\n", "| 2005 | 16.18556 | 58.84126 | 22.62387 |\n", "\n" ], "text/plain": [ " year pm25 o3 no2 \n", "1 1987 NaN 62.96966 23.49369\n", "2 1988 NaN 61.67708 24.52296\n", "3 1989 NaN 59.72727 26.14062\n", "4 1990 NaN 52.22917 22.59583\n", "5 1991 NaN 63.10417 21.38194\n", "6 1992 NaN 50.82870 24.78921\n", "7 1993 NaN 44.30093 25.76993\n", "8 1994 NaN 52.17844 28.47500\n", "9 1995 NaN 66.58750 27.26042\n", "10 1996 NaN 58.39583 26.38715\n", "11 1997 NaN 56.54167 25.48143\n", "12 1998 18.26467 50.66250 24.58649\n", "13 1999 18.49646 57.48864 24.66667\n", "14 2000 16.93806 55.76103 23.46082\n", "15 2001 16.92632 51.81984 25.06522\n", "16 2002 15.27335 54.88043 22.73750\n", "17 2003 15.23183 56.16608 24.62500\n", "18 2004 14.62864 44.48240 23.39130\n", "19 2005 16.18556 58.84126 22.62387" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chicago5 <- mutate(chicago3, year = as.POSIXlt(date)$year + 1900)\n", "years <- group_by(chicago5, year)\n", "summarize(years, pm25 = mean(pm25, na.rm = TRUE),\n", " o3 = max(o3tmean2, na.rm = TRUE),\n", " no2 = median(no2tmean2, na.rm = TRUE))" ] }, { "cell_type": "markdown", "id": "a63cc665", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In a slightly more complicated example, we might want to know what are the average levels of ozone (o3) and nitrogen dioxide (no2) within quintiles of pm25. A slicker way to do this would be through a regression model, but we can actually do this quickly with group_by() and summarize().\n", "\n", "- First, we can create a categorical variable of pm25 divided into quintiles.\n", "\n", "- Now we can group the data frame by the pm25.quint variable.\n", "\n", "- Finally, we can compute the mean of o3 and no2 within quintiles of pm25." ] }, { "cell_type": "code", "execution_count": 46, "id": "5423b372", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 3
pm25.quinto3no2
<fct><dbl><dbl>
(1.7,8.7] 21.6640117.99129
(8.7,12.4] 20.3824822.13004
(12.4,16.7]20.6616024.35708
(16.7,22.6]19.8812227.27132
(22.6,61.5]20.3177529.64427
NA 18.7904425.77585
\n" ], "text/latex": [ "A tibble: 6 × 3\n", "\\begin{tabular}{lll}\n", " pm25.quint & o3 & no2\\\\\n", " & & \\\\\n", "\\hline\n", "\t (1.7,8.7{]} & 21.66401 & 17.99129\\\\\n", "\t (8.7,12.4{]} & 20.38248 & 22.13004\\\\\n", "\t (12.4,16.7{]} & 20.66160 & 24.35708\\\\\n", "\t (16.7,22.6{]} & 19.88122 & 27.27132\\\\\n", "\t (22.6,61.5{]} & 20.31775 & 29.64427\\\\\n", "\t NA & 18.79044 & 25.77585\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 3\n", "\n", "| pm25.quint <fct> | o3 <dbl> | no2 <dbl> |\n", "|---|---|---|\n", "| (1.7,8.7] | 21.66401 | 17.99129 |\n", "| (8.7,12.4] | 20.38248 | 22.13004 |\n", "| (12.4,16.7] | 20.66160 | 24.35708 |\n", "| (16.7,22.6] | 19.88122 | 27.27132 |\n", "| (22.6,61.5] | 20.31775 | 29.64427 |\n", "| NA | 18.79044 | 25.77585 |\n", "\n" ], "text/plain": [ " pm25.quint o3 no2 \n", "1 (1.7,8.7] 21.66401 17.99129\n", "2 (8.7,12.4] 20.38248 22.13004\n", "3 (12.4,16.7] 20.66160 24.35708\n", "4 (16.7,22.6] 19.88122 27.27132\n", "5 (22.6,61.5] 20.31775 29.64427\n", "6 NA 18.79044 25.77585" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "qq <- quantile(chicago3\$pm25, seq(0, 1, 0.2), na.rm = TRUE)\n", "chicago6 <- mutate(chicago3, pm25.quint = cut(pm25, qq))\n", "quint <- group_by(chicago6, pm25.quint)\n", "summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),\n", " no2 = mean(no2tmean2, na.rm = TRUE))" ] }, { "cell_type": "markdown", "id": "e9d43f2a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Summary\n", "\n", "The dplyr package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of group_by() and summarize().\n", "\n", "* dplyr can work with other data frame \"backends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n", "\n", "* dplyr can be integrated with the data.table package for large fast tables\n", "\n", "* The dplyr package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time!\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.3.2" } }, "nbformat": 4, "nbformat_minor": 5 }