{ "cells": [ { "cell_type": "markdown", "id": "b3cf3c09", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Manipulating DataFrames with Pandas\n", "\n", "Feng Li\n", "\n", "School of Statistics and Mathematics\n", "\n", "Central University of Finance and Economics\n", "\n", "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n", "\n", "[https://feng.li/python](https://feng.li/python)" ] }, { "cell_type": "markdown", "id": "05cd1b5b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Importing data\n", "\n", "- A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. \n", "\n", "- Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported." ] }, { "cell_type": "markdown", "id": "69e28db5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Read extermal data\n", "\n", "- There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries. \n", "\n", "- These include \n", " - CSV: `read_csv()`\n", " - Excel: `read_excel()`\n", " - JSON: `read_json()`\n", " - Parquet Format: `read_parquet()`\n", " - Stata: `read_stata()`\n", " - ...\n", " \n", "These are beyond the scope of this tutorial, but are covered in https://pandas.pydata.org/docs/user_guide/io.html ." ] }, { "cell_type": "markdown", "id": "99f2153c", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's start with some more bacteria data, stored in csv format." ] }, { "cell_type": "code", "execution_count": 1, "id": "a2ea256e", "metadata": { "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Taxon,Patient,Tissue,Stool\r\n", "Firmicutes,1,632,305\r\n", "Firmicutes,2,136,4182\r\n", "Firmicutes,3,1174,703\r\n", "Firmicutes,4,408,3946\r\n", "Firmicutes,5,831,8605\r\n", "Firmicutes,6,693,50\r\n", "Firmicutes,7,718,717\r\n", "Firmicutes,8,173,33\r\n", "Firmicutes,9,228,80\r\n" ] } ], "source": [ "! head data/microbiome.csv" ] }, { "cell_type": "markdown", "id": "919d65a6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- This table can be read into a DataFrame using `read_csv`:" ] }, { "cell_type": "code", "execution_count": 2, "id": "d5fd0757", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientTissueStool
0Firmicutes1632305
1Firmicutes21364182
2Firmicutes31174703
3Firmicutes44083946
4Firmicutes58318605
...............
70Other112036
71Other123926
72Other132825
73Other141222
74Other1530532
\n", "

75 rows × 4 columns

\n", "
" ], "text/plain": [ " Taxon Patient Tissue Stool\n", "0 Firmicutes 1 632 305\n", "1 Firmicutes 2 136 4182\n", "2 Firmicutes 3 1174 703\n", "3 Firmicutes 4 408 3946\n", "4 Firmicutes 5 831 8605\n", ".. ... ... ... ...\n", "70 Other 11 203 6\n", "71 Other 12 392 6\n", "72 Other 13 28 25\n", "73 Other 14 12 22\n", "74 Other 15 305 32\n", "\n", "[75 rows x 4 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "mb = pd.read_csv(\"data/microbiome.csv\")\n", "mb" ] }, { "cell_type": "markdown", "id": "7999d005", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Notice that `read_csv` automatically considered the first row in the file to be a header row. We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`." ] }, { "cell_type": "code", "execution_count": 3, "id": "4e5ba16d", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
0TaxonPatientTissueStool
1Firmicutes1632305
2Firmicutes21364182
3Firmicutes31174703
4Firmicutes44083946
...............
71Other112036
72Other123926
73Other132825
74Other141222
75Other1530532
\n", "

76 rows × 4 columns

\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 Taxon Patient Tissue Stool\n", "1 Firmicutes 1 632 305\n", "2 Firmicutes 2 136 4182\n", "3 Firmicutes 3 1174 703\n", "4 Firmicutes 4 408 3946\n", ".. ... ... ... ...\n", "71 Other 11 203 6\n", "72 Other 12 392 6\n", "73 Other 13 28 25\n", "74 Other 14 12 22\n", "75 Other 15 305 32\n", "\n", "[76 rows x 4 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"data/microbiome.csv\", header=None)" ] }, { "cell_type": "markdown", "id": "d9c383f6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- For a more useful index, we can specify the first two columns, which together provide a unique index to the data." ] }, { "cell_type": "code", "execution_count": 4, "id": "aecf71d6", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TissueStool
TaxonPatient
Firmicutes1632305
21364182
31174703
44083946
58318605
............
Other112036
123926
132825
141222
1530532
\n", "

75 rows × 2 columns

\n", "
" ], "text/plain": [ " Tissue Stool\n", "Taxon Patient \n", "Firmicutes 1 632 305\n", " 2 136 4182\n", " 3 1174 703\n", " 4 408 3946\n", " 5 831 8605\n", "... ... ...\n", "Other 11 203 6\n", " 12 392 6\n", " 13 28 25\n", " 14 12 22\n", " 15 305 32\n", "\n", "[75 rows x 2 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb = pd.read_csv(\"data/microbiome.csv\", index_col=['Taxon','Patient'])\n", "mb" ] }, { "cell_type": "markdown", "id": "b871f14b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This is called a *hierarchical* index, which we will revisit later in the tutorial." ] }, { "cell_type": "markdown", "id": "1a44f9a6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument. This is useful for large dataset." ] }, { "cell_type": "code", "execution_count": 5, "id": "94ba2549", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientTissueStool
0Firmicutes1632305
1Firmicutes21364182
2Firmicutes58318605
3Firmicutes7718717
4Firmicutes817333
...............
67Other112036
68Other123926
69Other132825
70Other141222
71Other1530532
\n", "

72 rows × 4 columns

\n", "
" ], "text/plain": [ " Taxon Patient Tissue Stool\n", "0 Firmicutes 1 632 305\n", "1 Firmicutes 2 136 4182\n", "2 Firmicutes 5 831 8605\n", "3 Firmicutes 7 718 717\n", "4 Firmicutes 8 173 33\n", ".. ... ... ... ...\n", "67 Other 11 203 6\n", "68 Other 12 392 6\n", "69 Other 13 28 25\n", "70 Other 14 12 22\n", "71 Other 15 305 32\n", "\n", "[72 rows x 4 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"data/microbiome.csv\", skiprows=[3,4,6])" ] }, { "cell_type": "markdown", "id": "33b0d12a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows` to retrive the first `nrows`." ] }, { "cell_type": "code", "execution_count": 6, "id": "57b50886", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientTissueStool
0Firmicutes1632305
1Firmicutes21364182
2Firmicutes31174703
3Firmicutes44083946
\n", "
" ], "text/plain": [ " Taxon Patient Tissue Stool\n", "0 Firmicutes 1 632 305\n", "1 Firmicutes 2 136 4182\n", "2 Firmicutes 3 1174 703\n", "3 Firmicutes 4 408 3946" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"data/microbiome.csv\", nrows=4)" ] }, { "cell_type": "markdown", "id": "c04905d5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Alternately, if we want to process our data in reasonable chunks, the `chunksize` argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:" ] }, { "cell_type": "code", "execution_count": 7, "id": "bbe00897", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_chunks = pd.read_csv(\"data/microbiome.csv\", chunksize=15)\n", "data_chunks" ] }, { "cell_type": "markdown", "id": "b5de7a79", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Missing values\n", "\n", "- Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`." ] }, { "cell_type": "code", "execution_count": 8, "id": "67a1927b", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Taxon,Patient,Tissue,Stool\r\n", "Firmicutes,1,632,305\r\n", "Firmicutes,2,136,4182\r\n", "Firmicutes,3,,703\r\n", "Firmicutes,4,408,3946\r\n", "Firmicutes,5,831,8605\r\n", "Firmicutes,6,693,50\r\n", "Firmicutes,7,718,717\r\n", "Firmicutes,8,173,33\r\n", "Firmicutes,9,228,NA\r\n" ] } ], "source": [ "!head data/microbiome_missing.csv" ] }, { "cell_type": "code", "execution_count": 9, "id": "8634773b", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientTissueStool
0Firmicutes1632305.0
1Firmicutes21364182.0
2Firmicutes3NaN703.0
3Firmicutes44083946.0
4Firmicutes58318605.0
5Firmicutes669350.0
6Firmicutes7718717.0
7Firmicutes817333.0
8Firmicutes9228NaN
9Firmicutes101623196.0
10Firmicutes11372-99999.0
11Firmicutes1242554361.0
12Firmicutes131071667.0
13Firmicutes14?223.0
14Firmicutes152812377.0
15Proteobacteria116383886.0
16Proteobacteria224691821.0
17Proteobacteria3839661.0
18Proteobacteria4441418.0
19Proteobacteria51204483.0
\n", "
" ], "text/plain": [ " Taxon Patient Tissue Stool\n", "0 Firmicutes 1 632 305.0\n", "1 Firmicutes 2 136 4182.0\n", "2 Firmicutes 3 NaN 703.0\n", "3 Firmicutes 4 408 3946.0\n", "4 Firmicutes 5 831 8605.0\n", "5 Firmicutes 6 693 50.0\n", "6 Firmicutes 7 718 717.0\n", "7 Firmicutes 8 173 33.0\n", "8 Firmicutes 9 228 NaN\n", "9 Firmicutes 10 162 3196.0\n", "10 Firmicutes 11 372 -99999.0\n", "11 Firmicutes 12 4255 4361.0\n", "12 Firmicutes 13 107 1667.0\n", "13 Firmicutes 14 ? 223.0\n", "14 Firmicutes 15 281 2377.0\n", "15 Proteobacteria 1 1638 3886.0\n", "16 Proteobacteria 2 2469 1821.0\n", "17 Proteobacteria 3 839 661.0\n", "18 Proteobacteria 4 4414 18.0\n", "19 Proteobacteria 5 12044 83.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.read_csv(\"data/microbiome_missing.csv\").head(20)" ] }, { "cell_type": "markdown", "id": "fdb640c0", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Above, Pandas recognized `NA` and an empty field as missing data." ] }, { "cell_type": "code", "execution_count": 10, "id": "729dcfcf", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientTissueStool
0FalseFalseFalseFalse
1FalseFalseFalseFalse
2FalseFalseTrueFalse
3FalseFalseFalseFalse
4FalseFalseFalseFalse
5FalseFalseFalseFalse
6FalseFalseFalseFalse
7FalseFalseFalseFalse
8FalseFalseFalseTrue
9FalseFalseFalseFalse
10FalseFalseFalseFalse
11FalseFalseFalseFalse
12FalseFalseFalseFalse
13FalseFalseFalseFalse
14FalseFalseFalseFalse
15FalseFalseFalseFalse
16FalseFalseFalseFalse
17FalseFalseFalseFalse
18FalseFalseFalseFalse
19FalseFalseFalseFalse
\n", "
" ], "text/plain": [ " Taxon Patient Tissue Stool\n", "0 False False False False\n", "1 False False False False\n", "2 False False True False\n", "3 False False False False\n", "4 False False False False\n", "5 False False False False\n", "6 False False False False\n", "7 False False False False\n", "8 False False False True\n", "9 False False False False\n", "10 False False False False\n", "11 False False False False\n", "12 False False False False\n", "13 False False False False\n", "14 False False False False\n", "15 False False False False\n", "16 False False False False\n", "17 False False False False\n", "18 False False False False\n", "19 False False False False" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.isnull(pd.read_csv(\"data/microbiome_missing.csv\")).head(20)" ] }, { "cell_type": "markdown", "id": "eb7044f3", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark \"?\" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:\n", " " ] }, { "cell_type": "code", "execution_count": 11, "id": "e12e3438", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientTissueStool
0Firmicutes1632.0305.0
1Firmicutes2136.04182.0
2Firmicutes3NaN703.0
3Firmicutes4408.03946.0
4Firmicutes5831.08605.0
5Firmicutes6693.050.0
6Firmicutes7718.0717.0
7Firmicutes8173.033.0
8Firmicutes9228.0NaN
9Firmicutes10162.03196.0
10Firmicutes11372.0NaN
11Firmicutes124255.04361.0
12Firmicutes13107.01667.0
13Firmicutes14NaN223.0
14Firmicutes15281.02377.0
15Proteobacteria11638.03886.0
16Proteobacteria22469.01821.0
17Proteobacteria3839.0661.0
18Proteobacteria44414.018.0
19Proteobacteria512044.083.0
\n", "
" ], "text/plain": [ " Taxon Patient Tissue Stool\n", "0 Firmicutes 1 632.0 305.0\n", "1 Firmicutes 2 136.0 4182.0\n", "2 Firmicutes 3 NaN 703.0\n", "3 Firmicutes 4 408.0 3946.0\n", "4 Firmicutes 5 831.0 8605.0\n", "5 Firmicutes 6 693.0 50.0\n", "6 Firmicutes 7 718.0 717.0\n", "7 Firmicutes 8 173.0 33.0\n", "8 Firmicutes 9 228.0 NaN\n", "9 Firmicutes 10 162.0 3196.0\n", "10 Firmicutes 11 372.0 NaN\n", "11 Firmicutes 12 4255.0 4361.0\n", "12 Firmicutes 13 107.0 1667.0\n", "13 Firmicutes 14 NaN 223.0\n", "14 Firmicutes 15 281.0 2377.0\n", "15 Proteobacteria 1 1638.0 3886.0\n", "16 Proteobacteria 2 2469.0 1821.0\n", "17 Proteobacteria 3 839.0 661.0\n", "18 Proteobacteria 4 4414.0 18.0\n", "19 Proteobacteria 5 12044.0 83.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"data/microbiome_missing.csv\", na_values=['?', -99999]).head(20)" ] }, { "cell_type": "markdown", "id": "0e212c61", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "These can be specified on a column-wise basis using an appropriate dict as the argument for `na_values`." ] }, { "cell_type": "markdown", "id": "17109474", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Manipulating indices\n", "\n", "**Reindexing** allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested. \n", "\n", "For some variety, we will leave our digestive tract bacteria behind and employ some baseball data." ] }, { "cell_type": "markdown", "id": "3280a3bd", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Specify an unique index" ] }, { "cell_type": "code", "execution_count": 12, "id": "756910fd", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
88641womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
88643schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
88645myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
88649helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
88650johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
..................................................................
89525benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
89526benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
89530ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
89533aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
89534alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi \\\n", "id ... \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n", "88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n", "88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n", "... ... ... ... ... .. ... ... .. ... ... ... ... \n", "89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 \n", "89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n", "\n", " sb cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... .. ... ... ... ... ... ... \n", "89525 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "89526 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball = pd.read_csv(\"data/baseball.csv\", index_col='id')\n", "baseball" ] }, { "cell_type": "markdown", "id": "b86fe868", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Notice that we specified the `id` column as the index, since it appears to be an unique identifier. We could try to create a unique index ourselves by combining `player` and `year`:" ] }, { "cell_type": "code", "execution_count": 13, "id": "2aa60839", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
womacto012006womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
schilcu012006schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
myersmi012006myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
helliri012006helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
johnsra052006johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
..................................................................
benitar012007benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
benitar012007benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
ausmubr012007ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
aloumo012007aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
alomasa022007alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n", "schilcu012006 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... \n", "myersmi012006 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... \n", "helliri012006 helliri01 2006 1 MIL NL 20 3 0 0 0 ... \n", "johnsra052006 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... \n", "... ... ... ... ... .. ... ... .. ... ... ... \n", "benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... \n", "benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... \n", "ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... \n", "aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... \n", "alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "womacto012006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "schilcu012006 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "myersmi012006 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "helliri012006 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "johnsra052006 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... .. ... ... ... ... ... ... \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "ausmubr012007 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "aloumo012007 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "alomasa022007 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "player_id = baseball.player + baseball.year.astype(str)\n", "baseball_newind = baseball.copy()\n", "baseball_newind.index = player_id\n", "baseball_newind" ] }, { "cell_type": "code", "execution_count": 14, "id": "7151a142", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.index.is_unique" ] }, { "cell_type": "markdown", "id": "eafc2580", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- So, indices need not be unique. Our choice is not unique because some players change teams within years. The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:" ] }, { "cell_type": "code", "execution_count": 15, "id": "ae49125b", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
89534alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
89533aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
89530ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
89526benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
89525benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
..................................................................
88650johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
88649helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
88645myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
88643schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
88641womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi \\\n", "id ... \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n", "89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 \n", "89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 \n", "... ... ... ... ... .. ... ... .. ... ... ... ... \n", "88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n", "88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n", "\n", " sb cs bb so ibb hbp sh sf gidp \n", "id \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "89526 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "89525 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... .. ... ... ... ... ... ... \n", "88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reverse_index = baseball.index[::-1]\n", "baseball.reindex(reverse_index)" ] }, { "cell_type": "markdown", "id": "fb3fa641", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Notice that the `id` index is not sequential. Say we wanted to populate the table with every `id` value. We could specify and index that is a sequence from the first to the last `id` numbers in the database, and Pandas would fill in the missing data with `NaN` values:" ] }, { "cell_type": "code", "execution_count": 16, "id": "78d17c34", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
88641womacto012006.02.0CHNNL19.050.06.014.01.0...2.01.01.04.04.00.00.03.00.00.0
88642NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
88643schilcu012006.01.0BOSAL31.02.00.01.00.0...0.00.00.00.01.00.00.00.00.00.0
88644NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
88645myersmi012006.01.0NYAAL62.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
..................................................................
89529NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
89530ausmubr012007.01.0HOUNL117.0349.038.082.016.0...25.06.01.037.074.03.06.04.01.011.0
89531NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
89532NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
89533aloumo012007.01.0NYNNL87.0328.051.0112.019.0...49.03.00.027.030.05.02.00.03.013.0
\n", "

893 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b \\\n", "id \n", "88641 womacto01 2006.0 2.0 CHN NL 19.0 50.0 6.0 14.0 1.0 \n", "88642 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "88643 schilcu01 2006.0 1.0 BOS AL 31.0 2.0 0.0 1.0 0.0 \n", "88644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "88645 myersmi01 2006.0 1.0 NYA AL 62.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... ... ... ... \n", "89529 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "89530 ausmubr01 2007.0 1.0 HOU NL 117.0 349.0 38.0 82.0 16.0 \n", "89531 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "89532 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "89533 aloumo01 2007.0 1.0 NYN NL 87.0 328.0 51.0 112.0 19.0 \n", "\n", " ... rbi sb cs bb so ibb hbp sh sf gidp \n", "id ... \n", "88641 ... 2.0 1.0 1.0 4.0 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88642 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "88643 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88644 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "88645 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... ... ... ... ... \n", "89529 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "89530 ... 25.0 6.0 1.0 37.0 74.0 3.0 6.0 4.0 1.0 11.0 \n", "89531 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "89532 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "89533 ... 49.0 3.0 0.0 27.0 30.0 5.0 2.0 0.0 3.0 13.0 \n", "\n", "[893 rows x 22 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id_range = range(baseball.index.values.min(), baseball.index.values.max())\n", "baseball.reindex(id_range)" ] }, { "cell_type": "code", "execution_count": 17, "id": "eccc6c79", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
player
id
88641womacto01
88642mr.nobody
88643schilcu01
88644mr.nobody
88645myersmi01
......
89529mr.nobody
89530ausmubr01
89531mr.nobody
89532mr.nobody
89533aloumo01
\n", "

893 rows × 1 columns

\n", "
" ], "text/plain": [ " player\n", "id \n", "88641 womacto01\n", "88642 mr.nobody\n", "88643 schilcu01\n", "88644 mr.nobody\n", "88645 myersmi01\n", "... ...\n", "89529 mr.nobody\n", "89530 ausmubr01\n", "89531 mr.nobody\n", "89532 mr.nobody\n", "89533 aloumo01\n", "\n", "[893 rows x 1 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.reindex(id_range, fill_value='mr.nobody', columns=['player'])" ] }, { "cell_type": "markdown", "id": "77ea9d6d", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Index can also be sorted" ] }, { "cell_type": "code", "execution_count": 18, "id": "428064bb", "metadata": { "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
alomasa022007alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
aloumo012007aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
ausmubr012007ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
benitar012007benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
benitar012007benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
..................................................................
wickmbo012007wickmbo0120071ATLNL470000...0.00.00.000.00.00.00.00.00.0
williwo022007williwo0220071HOUNL3359360...2.00.00.0025.00.00.05.00.01.0
witasja012007witasja0120071TBAAL30000...0.00.00.000.00.00.00.00.00.0
womacto012006womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
zaungr012007zaungr0120071TORAL110331438024...52.00.00.05155.08.02.01.06.09.0
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... \n", "aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... \n", "ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... \n", "benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... \n", "benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... \n", "... ... ... ... ... .. ... ... .. ... ... ... \n", "wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... \n", "williwo022007 williwo02 2007 1 HOU NL 33 59 3 6 0 ... \n", "witasja012007 witasja01 2007 1 TBA AL 3 0 0 0 0 ... \n", "womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n", "zaungr012007 zaungr01 2007 1 TOR AL 110 331 43 80 24 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "alomasa022007 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "aloumo012007 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "ausmubr012007 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... .. ... ... ... ... ... ... \n", "wickmbo012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "williwo022007 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0 \n", "witasja012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "womacto012006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "zaungr012007 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.sort_index()" ] }, { "cell_type": "code", "execution_count": 19, "id": "678e30cc", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
zaungr012007zaungr0120071TORAL110331438024...52.00.00.05155.08.02.01.06.09.0
womacto012006womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
witasja012007witasja0120071TBAAL30000...0.00.00.000.00.00.00.00.00.0
williwo022007williwo0220071HOUNL3359360...2.00.00.0025.00.00.05.00.01.0
wickmbo012007wickmbo0120071ATLNL470000...0.00.00.000.00.00.00.00.00.0
..................................................................
benitar012007benitar0120072FLONL340000...0.00.00.000.00.00.00.00.00.0
benitar012007benitar0120071SFNNL190000...0.00.00.000.00.00.00.00.00.0
ausmubr012007ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
aloumo012007aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
alomasa022007alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "zaungr012007 zaungr01 2007 1 TOR AL 110 331 43 80 24 ... \n", "womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n", "witasja012007 witasja01 2007 1 TBA AL 3 0 0 0 0 ... \n", "williwo022007 williwo02 2007 1 HOU NL 33 59 3 6 0 ... \n", "wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... \n", "... ... ... ... ... .. ... ... .. ... ... ... \n", "benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... \n", "benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... \n", "ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... \n", "aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... \n", "alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "zaungr012007 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0 \n", "womacto012006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "witasja012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "williwo022007 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0 \n", "wickmbo012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... .. ... ... ... ... ... ... \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "ausmubr012007 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "aloumo012007 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "alomasa022007 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.sort_index(ascending=False)" ] }, { "cell_type": "code", "execution_count": 20, "id": "538484a1", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X2bX3babbbcsggidphhbphr...playerrrbisbsfshsostintteamyear
womacto012006105041.0190.0140.01...womacto0162.01.00.03.04.02CHN2006
schilcu01200600200.0310.010.00...schilcu0100.00.00.00.01.01BOS2006
myersmi01200600000.0620.000.00...myersmi0100.00.00.00.00.01NYA2006
helliri01200600300.0200.000.00...helliri0100.00.00.00.02.01MIL2006
johnsra05200600600.0330.010.00...johnsra0500.00.00.00.04.01NYA2006
..................................................................
benitar01200700000.0340.000.00...benitar0100.00.00.00.00.02FLO2007
benitar01200700000.0190.000.00...benitar0100.00.00.00.00.01SFN2007
ausmubr012007163349371.011711.0826.03...ausmubr013825.06.01.04.074.01HOU2007
aloumo012007191328270.08713.01122.013...aloumo015149.03.03.00.030.01NYN2007
alomasa022007102200.080.030.00...alomasa0210.00.00.00.03.01NYN2007
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " X2b X3b ab bb cs g gidp h hbp hr ... \\\n", "womacto012006 1 0 50 4 1.0 19 0.0 14 0.0 1 ... \n", "schilcu012006 0 0 2 0 0.0 31 0.0 1 0.0 0 ... \n", "myersmi012006 0 0 0 0 0.0 62 0.0 0 0.0 0 ... \n", "helliri012006 0 0 3 0 0.0 20 0.0 0 0.0 0 ... \n", "johnsra052006 0 0 6 0 0.0 33 0.0 1 0.0 0 ... \n", "... ... ... ... .. ... ... ... ... ... .. ... \n", "benitar012007 0 0 0 0 0.0 34 0.0 0 0.0 0 ... \n", "benitar012007 0 0 0 0 0.0 19 0.0 0 0.0 0 ... \n", "ausmubr012007 16 3 349 37 1.0 117 11.0 82 6.0 3 ... \n", "aloumo012007 19 1 328 27 0.0 87 13.0 112 2.0 13 ... \n", "alomasa022007 1 0 22 0 0.0 8 0.0 3 0.0 0 ... \n", "\n", " player r rbi sb sf sh so stint team year \n", "womacto012006 womacto01 6 2.0 1.0 0.0 3.0 4.0 2 CHN 2006 \n", "schilcu012006 schilcu01 0 0.0 0.0 0.0 0.0 1.0 1 BOS 2006 \n", "myersmi012006 myersmi01 0 0.0 0.0 0.0 0.0 0.0 1 NYA 2006 \n", "helliri012006 helliri01 0 0.0 0.0 0.0 0.0 2.0 1 MIL 2006 \n", "johnsra052006 johnsra05 0 0.0 0.0 0.0 0.0 4.0 1 NYA 2006 \n", "... ... .. ... ... ... ... ... ... ... ... \n", "benitar012007 benitar01 0 0.0 0.0 0.0 0.0 0.0 2 FLO 2007 \n", "benitar012007 benitar01 0 0.0 0.0 0.0 0.0 0.0 1 SFN 2007 \n", "ausmubr012007 ausmubr01 38 25.0 6.0 1.0 4.0 74.0 1 HOU 2007 \n", "aloumo012007 aloumo01 51 49.0 3.0 3.0 0.0 30.0 1 NYN 2007 \n", "alomasa022007 alomasa02 1 0.0 0.0 0.0 0.0 3.0 1 NYN 2007 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball_newind.sort_index(axis=1)" ] }, { "cell_type": "markdown", "id": "5e8efd07", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- **Ranking** does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series." ] }, { "cell_type": "code", "execution_count": 21, "id": "ff845a68", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "id\n", "88641 62.5\n", "88643 29.0\n", "88645 29.0\n", "88649 29.0\n", "88650 29.0\n", " ... \n", "89525 29.0\n", "89526 29.0\n", "89530 71.5\n", "89533 88.0\n", "89534 29.0\n", "Name: hr, Length: 100, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr.rank()" ] }, { "cell_type": "markdown", "id": "b628dbea", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:" ] }, { "cell_type": "code", "execution_count": 22, "id": "a46f31a9", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "id\n", "88641 58.0\n", "88643 1.0\n", "88645 2.0\n", "88649 3.0\n", "88650 4.0\n", " ... \n", "89525 55.0\n", "89526 56.0\n", "89530 72.0\n", "89533 88.0\n", "89534 57.0\n", "Name: hr, Length: 100, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.hr.rank(method='first')" ] }, { "cell_type": "markdown", "id": "852570c6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Calling the `DataFrame`'s `rank` method results in the ranks of all columns:" ] }, { "cell_type": "code", "execution_count": 23, "id": "8f6c74b5", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
886412.096.57.082.031.570.047.540.539.050.5...51.024.517.544.559.066.065.516.070.076.5
8864337.596.557.088.081.555.573.081.063.578.0...78.563.562.579.073.066.065.567.570.076.5
8864547.596.557.040.581.536.091.081.084.578.0...78.563.562.579.089.066.065.567.570.076.5
8864966.096.557.047.031.567.569.081.084.578.0...78.563.562.579.067.066.065.567.570.076.5
8865061.596.557.040.581.551.064.581.063.578.0...78.563.562.579.059.066.065.567.570.076.5
..................................................................
8952596.546.57.064.031.547.091.081.084.578.0...78.563.562.579.089.066.065.567.570.076.5
8952696.546.557.013.531.570.091.081.084.578.0...78.563.562.579.089.066.065.567.570.076.5
8953098.046.557.061.531.517.519.024.023.021.5...27.07.017.518.510.018.06.512.033.514.0
8953399.046.557.031.531.523.022.018.514.017.5...18.014.062.522.027.011.021.067.515.510.5
89534100.046.557.031.531.577.057.058.058.050.5...78.563.562.579.063.566.065.567.570.076.5
\n", "

100 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... \\\n", "id ... \n", "88641 2.0 96.5 7.0 82.0 31.5 70.0 47.5 40.5 39.0 50.5 ... \n", "88643 37.5 96.5 57.0 88.0 81.5 55.5 73.0 81.0 63.5 78.0 ... \n", "88645 47.5 96.5 57.0 40.5 81.5 36.0 91.0 81.0 84.5 78.0 ... \n", "88649 66.0 96.5 57.0 47.0 31.5 67.5 69.0 81.0 84.5 78.0 ... \n", "88650 61.5 96.5 57.0 40.5 81.5 51.0 64.5 81.0 63.5 78.0 ... \n", "... ... ... ... ... ... ... ... ... ... ... ... \n", "89525 96.5 46.5 7.0 64.0 31.5 47.0 91.0 81.0 84.5 78.0 ... \n", "89526 96.5 46.5 57.0 13.5 31.5 70.0 91.0 81.0 84.5 78.0 ... \n", "89530 98.0 46.5 57.0 61.5 31.5 17.5 19.0 24.0 23.0 21.5 ... \n", "89533 99.0 46.5 57.0 31.5 31.5 23.0 22.0 18.5 14.0 17.5 ... \n", "89534 100.0 46.5 57.0 31.5 31.5 77.0 57.0 58.0 58.0 50.5 ... \n", "\n", " rbi sb cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 51.0 24.5 17.5 44.5 59.0 66.0 65.5 16.0 70.0 76.5 \n", "88643 78.5 63.5 62.5 79.0 73.0 66.0 65.5 67.5 70.0 76.5 \n", "88645 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5 \n", "88649 78.5 63.5 62.5 79.0 67.0 66.0 65.5 67.5 70.0 76.5 \n", "88650 78.5 63.5 62.5 79.0 59.0 66.0 65.5 67.5 70.0 76.5 \n", "... ... ... ... ... ... ... ... ... ... ... \n", "89525 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5 \n", "89526 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5 \n", "89530 27.0 7.0 17.5 18.5 10.0 18.0 6.5 12.0 33.5 14.0 \n", "89533 18.0 14.0 62.5 22.0 27.0 11.0 21.0 67.5 15.5 10.5 \n", "89534 78.5 63.5 62.5 79.0 63.5 66.0 65.5 67.5 70.0 76.5 \n", "\n", "[100 rows x 22 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.rank(ascending=False)" ] }, { "cell_type": "code", "execution_count": 24, "id": "4d53e292", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rhhr
id
8864140.539.038.5
8864381.063.572.0
8864581.084.572.0
8864981.084.572.0
8865081.063.572.0
............
8952581.084.572.0
8952681.084.572.0
8953024.023.029.5
8953318.514.013.0
8953458.058.072.0
\n", "

100 rows × 3 columns

\n", "
" ], "text/plain": [ " r h hr\n", "id \n", "88641 40.5 39.0 38.5\n", "88643 81.0 63.5 72.0\n", "88645 81.0 84.5 72.0\n", "88649 81.0 84.5 72.0\n", "88650 81.0 63.5 72.0\n", "... ... ... ...\n", "89525 81.0 84.5 72.0\n", "89526 81.0 84.5 72.0\n", "89530 24.0 23.0 29.5\n", "89533 18.5 14.0 13.0\n", "89534 58.0 58.0 72.0\n", "\n", "[100 rows x 3 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball[['r','h','hr']].rank(ascending=False)" ] }, { "cell_type": "markdown", "id": "57a9a942", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Remove rows or columns via the `drop` method:" ] }, { "cell_type": "code", "execution_count": 25, "id": "369a5cd0", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(100, 22)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.shape" ] }, { "cell_type": "code", "execution_count": 26, "id": "7cf17ce7", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2b...rbisbcsbbsoibbhbpshsfgidp
id
88641womacto0120062CHNNL19506141...2.01.01.044.00.00.03.00.00.0
88643schilcu0120061BOSAL312010...0.00.00.001.00.00.00.00.00.0
88645myersmi0120061NYAAL620000...0.00.00.000.00.00.00.00.00.0
88649helliri0120061MILNL203000...0.00.00.002.00.00.00.00.00.0
88650johnsra0520061NYAAL336010...0.00.00.004.00.00.00.00.00.0
..................................................................
89521bondsba0120071SFNNL126340759414...66.05.00.013254.043.03.00.02.013.0
89523biggicr0120071HOUNL1415176813031...50.04.03.023112.00.03.07.05.05.0
89530ausmubr0120071HOUNL117349388216...25.06.01.03774.03.06.04.01.011.0
89533aloumo0120071NYNNL873285111219...49.03.00.02730.05.02.00.03.013.0
89534alomasa0220071NYNNL822131...0.00.00.003.00.00.00.00.00.0
\n", "

98 rows × 22 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b ... rbi \\\n", "id ... \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n", "88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n", "88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n", "... ... ... ... ... .. ... ... .. ... ... ... ... \n", "89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 ... 66.0 \n", "89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n", "\n", " sb cs bb so ibb hbp sh sf gidp \n", "id \n", "88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n", "88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", "88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... ... ... \n", "89521 5.0 0.0 132 54.0 43.0 3.0 0.0 2.0 13.0 \n", "89523 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0 \n", "89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n", "89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[98 rows x 22 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.drop([89525, 89526]) # does not modify the original DataFrame" ] }, { "cell_type": "code", "execution_count": 27, "id": "859860b5", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(100, 22)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.shape" ] }, { "cell_type": "code", "execution_count": 28, "id": "27a23cdd", "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playeryearstintteamlggabrhX2bX3bhrrbisbcsbbsoshsfgidp
id
88641womacto0120062CHNNL19506141012.01.01.044.03.00.00.0
88643schilcu0120061BOSAL312010000.00.00.001.00.00.00.0
88645myersmi0120061NYAAL620000000.00.00.000.00.00.00.0
88649helliri0120061MILNL203000000.00.00.002.00.00.00.0
88650johnsra0520061NYAAL336010000.00.00.004.00.00.00.0
...............................................................
89525benitar0120072FLONL340000000.00.00.000.00.00.00.0
89526benitar0120071SFNNL190000000.00.00.000.00.00.00.0
89530ausmubr0120071HOUNL1173493882163325.06.01.03774.04.01.011.0
89533aloumo0120071NYNNL87328511121911349.03.00.02730.00.03.013.0
89534alomasa0220071NYNNL822131000.00.00.003.00.00.00.0
\n", "

100 rows × 20 columns

\n", "
" ], "text/plain": [ " player year stint team lg g ab r h X2b X3b hr rbi \\\n", "id \n", "88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 \n", "88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 \n", "88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 0 0 0.0 \n", "88649 helliri01 2006 1 MIL NL 20 3 0 0 0 0 0 0.0 \n", "88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 0 0 0.0 \n", "... ... ... ... ... .. ... ... .. ... ... ... .. ... \n", "89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0 0 0.0 \n", "89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0 0 0.0 \n", "89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3 3 25.0 \n", "89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 \n", "89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 \n", "\n", " sb cs bb so sh sf gidp \n", "id \n", "88641 1.0 1.0 4 4.0 3.0 0.0 0.0 \n", "88643 0.0 0.0 0 1.0 0.0 0.0 0.0 \n", "88645 0.0 0.0 0 0.0 0.0 0.0 0.0 \n", "88649 0.0 0.0 0 2.0 0.0 0.0 0.0 \n", "88650 0.0 0.0 0 4.0 0.0 0.0 0.0 \n", "... ... ... .. ... ... ... ... \n", "89525 0.0 0.0 0 0.0 0.0 0.0 0.0 \n", "89526 0.0 0.0 0 0.0 0.0 0.0 0.0 \n", "89530 6.0 1.0 37 74.0 4.0 1.0 11.0 \n", "89533 3.0 0.0 27 30.0 0.0 3.0 13.0 \n", "89534 0.0 0.0 0 3.0 0.0 0.0 0.0 \n", "\n", "[100 rows x 20 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baseball.drop(['ibb','hbp'], axis=1) # Pandas axis=0 indicating row, axis=1 indicating column" ] }, { "cell_type": "markdown", "id": "8dd9963f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Join two Pandas DataFrames\n", "\n", "`DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.\n", "\n", "For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years:" ] }, { "cell_type": "code", "execution_count": 29, "id": "4cc5e9dc", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "hr2006 = baseball[baseball.year==2006].xs('hr', axis=1)\n", "hr2006.index = baseball.player[baseball.year==2006]\n", "\n", "hr2007 = baseball[baseball.year==2007].xs('hr', axis=1)\n", "hr2007.index = baseball.player[baseball.year==2007]" ] }, { "cell_type": "code", "execution_count": 30, "id": "63d4c793", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "player\n", "womacto01 1\n", "schilcu01 0\n", "myersmi01 0\n", "helliri01 0\n", "johnsra05 0\n", "finlest01 6\n", "gonzalu01 15\n", "seleaa01 0\n", "dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr2006 = pd.Series(baseball.hr[baseball.year==2006].values, index=baseball.player[baseball.year==2006])\n", "hr2006" ] }, { "cell_type": "code", "execution_count": 31, "id": "c99cc635", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "player\n", "francju01 0\n", "francju01 1\n", "zaungr01 10\n", "witasja01 0\n", "williwo02 1\n", " ..\n", "benitar01 0\n", "benitar01 0\n", "ausmubr01 3\n", "aloumo01 13\n", "alomasa02 0\n", "Length: 92, dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr2007 = pd.Series(baseball.hr[baseball.year==2007].values, index=baseball.player[baseball.year==2007])\n", "hr2007" ] }, { "cell_type": "code", "execution_count": 32, "id": "8dafdacc", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "player\n", "alomasa02 NaN\n", "aloumo01 NaN\n", "ausmubr01 NaN\n", "benitar01 NaN\n", "benitar01 NaN\n", " ..\n", "wickmbo01 NaN\n", "williwo02 NaN\n", "witasja01 NaN\n", "womacto01 NaN\n", "zaungr01 NaN\n", "Length: 94, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr_total = hr2006 + hr2007 \n", "hr_total" ] }, { "cell_type": "markdown", "id": "f08cb529", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Pandas' data alignment places `NaN` values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years." ] }, { "cell_type": "code", "execution_count": 33, "id": "21eed4d9", "metadata": { "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "player\n", "finlest01 7.0\n", "gonzalu01 30.0\n", "johnsra05 0.0\n", "myersmi01 0.0\n", "schilcu01 0.0\n", "seleaa01 0.0\n", "dtype: float64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hr_total[hr_total.notnull()]" ] }, { "cell_type": "markdown", "id": "af9a9d27", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Merging and joining DataFrame objects\n", "\n", "In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the Automatic Identification System (AIS), a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed. \n", "\n", "The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. AIS signals are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both spatial and temporal information.\n", "\n", "For our purposes, we will use summarized data that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc." ] }, { "cell_type": "code", "execution_count": 34, "id": "4ff49465", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mmsinametransitsegmentseg_lengthavg_sogmin_sogmax_sogpdgt10st_timeend_time
01Us Govt Ves115.113.29.214.596.52/10/09 16:032/10/09 16:27
11Dredge Capt Frank1113.518.610.420.6100.04/6/09 14:314/6/09 15:20
21Us Gov Vessel114.316.210.320.5100.04/6/09 14:364/6/09 14:55
31Us Gov Vessel219.215.414.516.1100.04/10/09 17:584/10/09 18:34
41Dredge Capt Frank219.215.414.616.2100.04/10/09 17:594/10/09 18:35
....................................
262521999999999Triple Attraction315.320.019.620.4100.06/15/10 12:496/15/10 13:05
262522999999999Triple Attraction4118.719.218.419.9100.06/15/10 21:326/15/10 22:29
262523999999999Triple Attraction6117.417.014.718.4100.06/17/10 19:166/17/10 20:17
262524999999999Triple Attraction7131.514.213.415.1100.06/18/10 2:526/18/10 5:03
262525999999999Triple Attraction8119.818.616.119.5100.06/18/10 10:196/18/10 11:22
\n", "

262526 rows × 11 columns

\n", "
" ], "text/plain": [ " mmsi name transit segment seg_length avg_sog \\\n", "0 1 Us Govt Ves 1 1 5.1 13.2 \n", "1 1 Dredge Capt Frank 1 1 13.5 18.6 \n", "2 1 Us Gov Vessel 1 1 4.3 16.2 \n", "3 1 Us Gov Vessel 2 1 9.2 15.4 \n", "4 1 Dredge Capt Frank 2 1 9.2 15.4 \n", "... ... ... ... ... ... ... \n", "262521 999999999 Triple Attraction 3 1 5.3 20.0 \n", "262522 999999999 Triple Attraction 4 1 18.7 19.2 \n", "262523 999999999 Triple Attraction 6 1 17.4 17.0 \n", "262524 999999999 Triple Attraction 7 1 31.5 14.2 \n", "262525 999999999 Triple Attraction 8 1 19.8 18.6 \n", "\n", " min_sog max_sog pdgt10 st_time end_time \n", "0 9.2 14.5 96.5 2/10/09 16:03 2/10/09 16:27 \n", "1 10.4 20.6 100.0 4/6/09 14:31 4/6/09 15:20 \n", "2 10.3 20.5 100.0 4/6/09 14:36 4/6/09 14:55 \n", "3 14.5 16.1 100.0 4/10/09 17:58 4/10/09 18:34 \n", "4 14.6 16.2 100.0 4/10/09 17:59 4/10/09 18:35 \n", "... ... ... ... ... ... \n", "262521 19.6 20.4 100.0 6/15/10 12:49 6/15/10 13:05 \n", "262522 18.4 19.9 100.0 6/15/10 21:32 6/15/10 22:29 \n", "262523 14.7 18.4 100.0 6/17/10 19:16 6/17/10 20:17 \n", "262524 13.4 15.1 100.0 6/18/10 2:52 6/18/10 5:03 \n", "262525 16.1 19.5 100.0 6/18/10 10:19 6/18/10 11:22 \n", "\n", "[262526 rows x 11 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segments = pd.read_csv(\"data/AIS/transit_segments.csv\")\n", "segments" ] }, { "cell_type": "markdown", "id": "90181955", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- In addition to the behavior of each vessel, we may want a little more information regarding the vessels themselves. In the `data/AIS` folder there is a second table that contains information about each of the ships that traveled the segments in the `segments` table." ] }, { "cell_type": "code", "execution_count": 35, "id": "827cc4e1", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_namesnamessovflagflag_typenum_loasloamax_loanum_typestype
mmsi
18Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing
93000000009/Raven/ShearwaterNUnknownUnknown250.0/62.062.02Pleasure/Tug
211Us Gov VesselYUnknownUnknown1208.0208.01Unknown
742Mcfaul/Sarah BellNUnknownUnknown1155.0155.01Unknown
1033Ron G/Us Navy Warship 103/Us Warship 103YUnknownUnknown226.0/155.0155.02Tanker/Unknown
.................................
9191919191OiNUnknownUnknown120.020.01Pleasure
9671911901PathfinderNUnknownUnknown131.031.02BigTow/Towing
9753186421Island ExpressNUnknownUnknown120.020.01Towing
9876543212Island Lookout/Island TideNUnknownUnknown222.0/23.023.02Fishing/Towing
9999999991Triple AttractionNUnknownUnknown130.030.01Pleasure
\n", "

10771 rows × 10 columns

\n", "
" ], "text/plain": [ " num_names names sov \\\n", "mmsi \n", "1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "9 3 000000009/Raven/Shearwater N \n", "21 1 Us Gov Vessel Y \n", "74 2 Mcfaul/Sarah Bell N \n", "103 3 Ron G/Us Navy Warship 103/Us Warship 103 Y \n", "... ... ... .. \n", "919191919 1 Oi N \n", "967191190 1 Pathfinder N \n", "975318642 1 Island Express N \n", "987654321 2 Island Lookout/Island Tide N \n", "999999999 1 Triple Attraction N \n", "\n", " flag flag_type num_loas loa \\\n", "mmsi \n", "1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "9 Unknown Unknown 2 50.0/62.0 \n", "21 Unknown Unknown 1 208.0 \n", "74 Unknown Unknown 1 155.0 \n", "103 Unknown Unknown 2 26.0/155.0 \n", "... ... ... ... ... \n", "919191919 Unknown Unknown 1 20.0 \n", "967191190 Unknown Unknown 1 31.0 \n", "975318642 Unknown Unknown 1 20.0 \n", "987654321 Unknown Unknown 2 22.0/23.0 \n", "999999999 Unknown Unknown 1 30.0 \n", "\n", " max_loa num_types type \n", "mmsi \n", "1 156.0 4 Dredging/MilOps/Reserved/Towing \n", "9 62.0 2 Pleasure/Tug \n", "21 208.0 1 Unknown \n", "74 155.0 1 Unknown \n", "103 155.0 2 Tanker/Unknown \n", "... ... ... ... \n", "919191919 20.0 1 Pleasure \n", "967191190 31.0 2 BigTow/Towing \n", "975318642 20.0 1 Towing \n", "987654321 23.0 2 Fishing/Towing \n", "999999999 30.0 1 Pleasure \n", "\n", "[10771 rows x 10 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vessels = pd.read_csv(\"data/AIS/vessel_information.csv\", index_col='mmsi')\n", "vessels" ] }, { "cell_type": "code", "execution_count": 36, "id": "fd4a43ac", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Cargo 5622\n", "Tanker 2440\n", "Pleasure 601\n", "Tug 221\n", "Sailing 205\n", " ... \n", "AntiPol/Other 1\n", "Fishing/Law 1\n", "Cargo/Other/Towing 1\n", "Cargo/Fishing 1\n", "Fishing/Reserved/Towing 1\n", "Name: type, Length: 206, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vessels.type.value_counts()" ] }, { "cell_type": "markdown", "id": "55e31328", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- The challenge, however, is that several ships have travelled multiple segments, so there is not a one-to-one relationship between the rows of the two tables. The table of vessel information has a *one-to-many* relationship with the segments.\n", "\n", "- In Pandas, we can combine tables according to the value of one or more *keys* that are used to identify rows, much like an index. Using a trivial example:" ] }, { "cell_type": "code", "execution_count": 40, "id": "fdb16698", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import numpy as np\n", "df1 = pd.DataFrame(dict(id=range(4), age=np.random.randint(18, 31, size=4)))\n", "df2 = pd.DataFrame(dict(id=range(6), score=np.random.random(size=6)))" ] }, { "cell_type": "code", "execution_count": 41, "id": "22977cbb", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idagescore
00240.666117
11230.249913
22260.261396
33300.651306
\n", "
" ], "text/plain": [ " id age score\n", "0 0 24 0.666117\n", "1 1 23 0.249913\n", "2 2 26 0.261396\n", "3 3 30 0.651306" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2)" ] }, { "cell_type": "markdown", "id": "1d27e510", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Notice that without any information about which column to use as a key, Pandas did the right thing and used the `id` column in both tables. Unless specified otherwise, `merge` will used any common column names as keys for merging the tables. \n", "\n", "- By default, `merge` performs an **inner join** on the tables, meaning that the merged table represents an intersection of the two tables." ] }, { "cell_type": "code", "execution_count": 42, "id": "dab183d4", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idagescore
0024.00.666117
1123.00.249913
2226.00.261396
3330.00.651306
44NaN0.468143
55NaN0.461524
\n", "
" ], "text/plain": [ " id age score\n", "0 0 24.0 0.666117\n", "1 1 23.0 0.249913\n", "2 2 26.0 0.261396\n", "3 3 30.0 0.651306\n", "4 4 NaN 0.468143\n", "5 5 NaN 0.461524" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='outer')" ] }, { "cell_type": "markdown", "id": "2d8409a8", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- The **outer join** above yields the union of the two tables, so all rows are represented, with missing values inserted as appropriate. One can also perform **right** and **left** joins to include all rows of the right or left table (*i.e.* first or second argument to `merge`), but not necessarily the other.\n", "\n", "- Looking at the two datasets that we wish to merge:" ] }, { "cell_type": "code", "execution_count": 43, "id": "32ceffa7", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mmsinametransitsegmentseg_lengthavg_sogmin_sogmax_sogpdgt10st_timeend_time
01Us Govt Ves115.113.29.214.596.52/10/09 16:032/10/09 16:27
\n", "
" ], "text/plain": [ " mmsi name transit segment seg_length avg_sog min_sog max_sog \\\n", "0 1 Us Govt Ves 1 1 5.1 13.2 9.2 14.5 \n", "\n", " pdgt10 st_time end_time \n", "0 96.5 2/10/09 16:03 2/10/09 16:27 " ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segments.head(1)" ] }, { "cell_type": "code", "execution_count": 44, "id": "17fa25c2", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_namesnamessovflagflag_typenum_loasloamax_loanum_typestype
mmsi
18Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing
\n", "
" ], "text/plain": [ " num_names names sov \\\n", "mmsi \n", "1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "\n", " flag flag_type num_loas loa \\\n", "mmsi \n", "1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "\n", " max_loa num_types type \n", "mmsi \n", "1 156.0 4 Dredging/MilOps/Reserved/Towing " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vessels.head(1)" ] }, { "cell_type": "markdown", "id": "0766d18f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- We see that there is a `mmsi` value (a vessel identifier) in each table, but it is used as an index for the `vessels` table. In this case, we have to specify to join on the index for this table, and on the `mmsi` column for the other." ] }, { "cell_type": "code", "execution_count": 45, "id": "1c9d7b17", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "segments_merged = pd.merge(vessels, segments, left_index=True, right_on='mmsi')" ] }, { "cell_type": "code", "execution_count": 46, "id": "6a8100a4", "metadata": { "scrolled": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_namesnamessovflagflag_typenum_loasloamax_loanum_typestype...nametransitsegmentseg_lengthavg_sogmin_sogmax_sogpdgt10st_timeend_time
08Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Us Govt Ves115.113.29.214.596.52/10/09 16:032/10/09 16:27
18Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Dredge Capt Frank1113.518.610.420.6100.04/6/09 14:314/6/09 15:20
28Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Us Gov Vessel114.316.210.320.5100.04/6/09 14:364/6/09 14:55
38Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Us Gov Vessel219.215.414.516.1100.04/10/09 17:584/10/09 18:34
48Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Dredge Capt Frank219.215.414.616.2100.04/10/09 17:594/10/09 18:35
..................................................................
2625211Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction315.320.019.620.4100.06/15/10 12:496/15/10 13:05
2625221Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction4118.719.218.419.9100.06/15/10 21:326/15/10 22:29
2625231Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction6117.417.014.718.4100.06/17/10 19:166/17/10 20:17
2625241Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction7131.514.213.415.1100.06/18/10 2:526/18/10 5:03
2625251Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction8119.818.616.119.5100.06/18/10 10:196/18/10 11:22
\n", "

262353 rows × 21 columns

\n", "
" ], "text/plain": [ " num_names names sov \\\n", "0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "... ... ... .. \n", "262521 1 Triple Attraction N \n", "262522 1 Triple Attraction N \n", "262523 1 Triple Attraction N \n", "262524 1 Triple Attraction N \n", "262525 1 Triple Attraction N \n", "\n", " flag flag_type num_loas loa \\\n", "0 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "2 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "3 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "4 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "... ... ... ... ... \n", "262521 Unknown Unknown 1 30.0 \n", "262522 Unknown Unknown 1 30.0 \n", "262523 Unknown Unknown 1 30.0 \n", "262524 Unknown Unknown 1 30.0 \n", "262525 Unknown Unknown 1 30.0 \n", "\n", " max_loa num_types type ... \\\n", "0 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "1 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "2 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "3 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "4 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "... ... ... ... ... \n", "262521 30.0 1 Pleasure ... \n", "262522 30.0 1 Pleasure ... \n", "262523 30.0 1 Pleasure ... \n", "262524 30.0 1 Pleasure ... \n", "262525 30.0 1 Pleasure ... \n", "\n", " name transit segment seg_length avg_sog min_sog \\\n", "0 Us Govt Ves 1 1 5.1 13.2 9.2 \n", "1 Dredge Capt Frank 1 1 13.5 18.6 10.4 \n", "2 Us Gov Vessel 1 1 4.3 16.2 10.3 \n", "3 Us Gov Vessel 2 1 9.2 15.4 14.5 \n", "4 Dredge Capt Frank 2 1 9.2 15.4 14.6 \n", "... ... ... ... ... ... ... \n", "262521 Triple Attraction 3 1 5.3 20.0 19.6 \n", "262522 Triple Attraction 4 1 18.7 19.2 18.4 \n", "262523 Triple Attraction 6 1 17.4 17.0 14.7 \n", "262524 Triple Attraction 7 1 31.5 14.2 13.4 \n", "262525 Triple Attraction 8 1 19.8 18.6 16.1 \n", "\n", " max_sog pdgt10 st_time end_time \n", "0 14.5 96.5 2/10/09 16:03 2/10/09 16:27 \n", "1 20.6 100.0 4/6/09 14:31 4/6/09 15:20 \n", "2 20.5 100.0 4/6/09 14:36 4/6/09 14:55 \n", "3 16.1 100.0 4/10/09 17:58 4/10/09 18:34 \n", "4 16.2 100.0 4/10/09 17:59 4/10/09 18:35 \n", "... ... ... ... ... \n", "262521 20.4 100.0 6/15/10 12:49 6/15/10 13:05 \n", "262522 19.9 100.0 6/15/10 21:32 6/15/10 22:29 \n", "262523 18.4 100.0 6/17/10 19:16 6/17/10 20:17 \n", "262524 15.1 100.0 6/18/10 2:52 6/18/10 5:03 \n", "262525 19.5 100.0 6/18/10 10:19 6/18/10 11:22 \n", "\n", "[262353 rows x 21 columns]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segments_merged" ] }, { "cell_type": "markdown", "id": "5e9869a2", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- In this case, the default inner join is suitable; we are not interested in observations from either table that do not have corresponding entries in the other. \n", "\n", "- Notice that `mmsi` field that was an index on the `vessels` table is no longer an index on the merged table.\n", "\n", "- Here, we used the `merge` function to perform the merge; we could also have used the `merge` method for either of the tables:" ] }, { "cell_type": "code", "execution_count": 47, "id": "34588c9a", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_namesnamessovflagflag_typenum_loasloamax_loanum_typestype...nametransitsegmentseg_lengthavg_sogmin_sogmax_sogpdgt10st_timeend_time
08Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Us Govt Ves115.113.29.214.596.52/10/09 16:032/10/09 16:27
18Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Dredge Capt Frank1113.518.610.420.6100.04/6/09 14:314/6/09 15:20
28Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Us Gov Vessel114.316.210.320.5100.04/6/09 14:364/6/09 14:55
38Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Us Gov Vessel219.215.414.516.1100.04/10/09 17:584/10/09 18:34
48Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...Dredge Capt Frank219.215.414.616.2100.04/10/09 17:594/10/09 18:35
..................................................................
2625211Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction315.320.019.620.4100.06/15/10 12:496/15/10 13:05
2625221Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction4118.719.218.419.9100.06/15/10 21:326/15/10 22:29
2625231Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction6117.417.014.718.4100.06/17/10 19:166/17/10 20:17
2625241Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction7131.514.213.415.1100.06/18/10 2:526/18/10 5:03
2625251Triple AttractionNUnknownUnknown130.030.01Pleasure...Triple Attraction8119.818.616.119.5100.06/18/10 10:196/18/10 11:22
\n", "

262353 rows × 21 columns

\n", "
" ], "text/plain": [ " num_names names sov \\\n", "0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "... ... ... .. \n", "262521 1 Triple Attraction N \n", "262522 1 Triple Attraction N \n", "262523 1 Triple Attraction N \n", "262524 1 Triple Attraction N \n", "262525 1 Triple Attraction N \n", "\n", " flag flag_type num_loas loa \\\n", "0 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "2 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "3 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "4 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "... ... ... ... ... \n", "262521 Unknown Unknown 1 30.0 \n", "262522 Unknown Unknown 1 30.0 \n", "262523 Unknown Unknown 1 30.0 \n", "262524 Unknown Unknown 1 30.0 \n", "262525 Unknown Unknown 1 30.0 \n", "\n", " max_loa num_types type ... \\\n", "0 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "1 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "2 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "3 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "4 156.0 4 Dredging/MilOps/Reserved/Towing ... \n", "... ... ... ... ... \n", "262521 30.0 1 Pleasure ... \n", "262522 30.0 1 Pleasure ... \n", "262523 30.0 1 Pleasure ... \n", "262524 30.0 1 Pleasure ... \n", "262525 30.0 1 Pleasure ... \n", "\n", " name transit segment seg_length avg_sog min_sog \\\n", "0 Us Govt Ves 1 1 5.1 13.2 9.2 \n", "1 Dredge Capt Frank 1 1 13.5 18.6 10.4 \n", "2 Us Gov Vessel 1 1 4.3 16.2 10.3 \n", "3 Us Gov Vessel 2 1 9.2 15.4 14.5 \n", "4 Dredge Capt Frank 2 1 9.2 15.4 14.6 \n", "... ... ... ... ... ... ... \n", "262521 Triple Attraction 3 1 5.3 20.0 19.6 \n", "262522 Triple Attraction 4 1 18.7 19.2 18.4 \n", "262523 Triple Attraction 6 1 17.4 17.0 14.7 \n", "262524 Triple Attraction 7 1 31.5 14.2 13.4 \n", "262525 Triple Attraction 8 1 19.8 18.6 16.1 \n", "\n", " max_sog pdgt10 st_time end_time \n", "0 14.5 96.5 2/10/09 16:03 2/10/09 16:27 \n", "1 20.6 100.0 4/6/09 14:31 4/6/09 15:20 \n", "2 20.5 100.0 4/6/09 14:36 4/6/09 14:55 \n", "3 16.1 100.0 4/10/09 17:58 4/10/09 18:34 \n", "4 16.2 100.0 4/10/09 17:59 4/10/09 18:35 \n", "... ... ... ... ... \n", "262521 20.4 100.0 6/15/10 12:49 6/15/10 13:05 \n", "262522 19.9 100.0 6/15/10 21:32 6/15/10 22:29 \n", "262523 18.4 100.0 6/17/10 19:16 6/17/10 20:17 \n", "262524 15.1 100.0 6/18/10 2:52 6/18/10 5:03 \n", "262525 19.5 100.0 6/18/10 10:19 6/18/10 11:22 \n", "\n", "[262353 rows x 21 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vessels.merge(segments, left_index=True, right_on='mmsi')" ] }, { "cell_type": "markdown", "id": "b29c609f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Occasionally, there will be fields with the same in both tables that we do not wish to use to join the tables; they may contain different information, despite having the same name. In this case, Pandas will by default append suffixes `_x` and `_y` to the columns to uniquely identify them." ] }, { "cell_type": "code", "execution_count": 48, "id": "98fa379d", "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_namesnamessovflagflag_typenum_loasloamax_loanum_typestype_x...transitsegmentseg_lengthavg_sogmin_sogmax_sogpdgt10st_timeend_timetype_y
08Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...115.113.29.214.596.52/10/09 16:032/10/09 16:27foo
18Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...1113.518.610.420.6100.04/6/09 14:314/6/09 15:20foo
28Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...114.316.210.320.5100.04/6/09 14:364/6/09 14:55foo
38Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...219.215.414.516.1100.04/10/09 17:584/10/09 18:34foo
48Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...YUnknownUnknown742.0/48.0/57.0/90.0/138.0/154.0/156.0156.04Dredging/MilOps/Reserved/Towing...219.215.414.616.2100.04/10/09 17:594/10/09 18:35foo
..................................................................
2625211Triple AttractionNUnknownUnknown130.030.01Pleasure...315.320.019.620.4100.06/15/10 12:496/15/10 13:05foo
2625221Triple AttractionNUnknownUnknown130.030.01Pleasure...4118.719.218.419.9100.06/15/10 21:326/15/10 22:29foo
2625231Triple AttractionNUnknownUnknown130.030.01Pleasure...6117.417.014.718.4100.06/17/10 19:166/17/10 20:17foo
2625241Triple AttractionNUnknownUnknown130.030.01Pleasure...7131.514.213.415.1100.06/18/10 2:526/18/10 5:03foo
2625251Triple AttractionNUnknownUnknown130.030.01Pleasure...8119.818.616.119.5100.06/18/10 10:196/18/10 11:22foo
\n", "

262353 rows × 22 columns

\n", "
" ], "text/plain": [ " num_names names sov \\\n", "0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n", "... ... ... .. \n", "262521 1 Triple Attraction N \n", "262522 1 Triple Attraction N \n", "262523 1 Triple Attraction N \n", "262524 1 Triple Attraction N \n", "262525 1 Triple Attraction N \n", "\n", " flag flag_type num_loas loa \\\n", "0 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "2 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "3 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "4 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n", "... ... ... ... ... \n", "262521 Unknown Unknown 1 30.0 \n", "262522 Unknown Unknown 1 30.0 \n", "262523 Unknown Unknown 1 30.0 \n", "262524 Unknown Unknown 1 30.0 \n", "262525 Unknown Unknown 1 30.0 \n", "\n", " max_loa num_types type_x ... transit \\\n", "0 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 \n", "1 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 \n", "2 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 \n", "3 156.0 4 Dredging/MilOps/Reserved/Towing ... 2 \n", "4 156.0 4 Dredging/MilOps/Reserved/Towing ... 2 \n", "... ... ... ... ... ... \n", "262521 30.0 1 Pleasure ... 3 \n", "262522 30.0 1 Pleasure ... 4 \n", "262523 30.0 1 Pleasure ... 6 \n", "262524 30.0 1 Pleasure ... 7 \n", "262525 30.0 1 Pleasure ... 8 \n", "\n", " segment seg_length avg_sog min_sog max_sog pdgt10 st_time \\\n", "0 1 5.1 13.2 9.2 14.5 96.5 2/10/09 16:03 \n", "1 1 13.5 18.6 10.4 20.6 100.0 4/6/09 14:31 \n", "2 1 4.3 16.2 10.3 20.5 100.0 4/6/09 14:36 \n", "3 1 9.2 15.4 14.5 16.1 100.0 4/10/09 17:58 \n", "4 1 9.2 15.4 14.6 16.2 100.0 4/10/09 17:59 \n", "... ... ... ... ... ... ... ... \n", "262521 1 5.3 20.0 19.6 20.4 100.0 6/15/10 12:49 \n", "262522 1 18.7 19.2 18.4 19.9 100.0 6/15/10 21:32 \n", "262523 1 17.4 17.0 14.7 18.4 100.0 6/17/10 19:16 \n", "262524 1 31.5 14.2 13.4 15.1 100.0 6/18/10 2:52 \n", "262525 1 19.8 18.6 16.1 19.5 100.0 6/18/10 10:19 \n", "\n", " end_time type_y \n", "0 2/10/09 16:27 foo \n", "1 4/6/09 15:20 foo \n", "2 4/6/09 14:55 foo \n", "3 4/10/09 18:34 foo \n", "4 4/10/09 18:35 foo \n", "... ... ... \n", "262521 6/15/10 13:05 foo \n", "262522 6/15/10 22:29 foo \n", "262523 6/17/10 20:17 foo \n", "262524 6/18/10 5:03 foo \n", "262525 6/18/10 11:22 foo \n", "\n", "[262353 rows x 22 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segments['type'] = 'foo'\n", "pd.merge(vessels, segments, left_index=True, right_on='mmsi')" ] }, { "cell_type": "markdown", "id": "b034ddd3", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- This behavior can be overridden by specifying a `suffixes` argument, containing a list of the suffixes to be used for the columns of the left and right columns, respectively." ] }, { "cell_type": "markdown", "id": "0efb4d13", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Concatenation\n", "\n", "A common data manipulation is appending rows or columns to a dataset that already conform to the dimensions of the exsiting rows or colums, respectively. In NumPy, this is done either with `concatenate` or the convenience functions `c_` and `r_`:" ] }, { "cell_type": "code", "execution_count": 49, "id": "a1460278", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([0.42336207, 0.74807952, 0.61839076, 0.54794432, 0.06227732,\n", " 0.71618874, 0.31763132, 0.26021656, 0.22395665, 0.08499033])" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.concatenate([np.random.random(5), np.random.random(5)])" ] }, { "cell_type": "code", "execution_count": 50, "id": "6fb9b19a", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([0.48601962, 0.11484701, 0.93892836, 0.16884999, 0.71700162,\n", " 0.92519913, 0.26827622, 0.41866975, 0.59348726, 0.06054373])" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.r_[np.random.random(5), np.random.random(5)]" ] }, { "cell_type": "code", "execution_count": 51, "id": "52cc9244", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "array([[0.48861909, 0.23961022],\n", " [0.68685816, 0.7662155 ],\n", " [0.76304197, 0.63356894],\n", " [0.45533848, 0.36265383],\n", " [0.85205653, 0.84605096]])" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.c_[np.random.random(5), np.random.random(5)]" ] }, { "cell_type": "markdown", "id": "ef185e6f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- This operation is also called *binding* or *stacking*. With Pandas' indexed data structures, there are additional considerations as the overlap in index values between two data structures affects how they are concatenate.\n", "\n", "- Lets import two microbiome datasets, each consisting of counts of microorganiams from a particular patient. We will use the first column of each dataset as the index." ] }, { "cell_type": "code", "execution_count": 52, "id": "85e0f589", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://mirrors.163.com/pypi/simple/\r\n", "Requirement already satisfied: xlrd in /home/fli/.local/lib/python3.9/site-packages (2.0.1)\r\n", "Requirement already satisfied: openpyxl in /home/fli/.local/lib/python3.9/site-packages (3.0.9)\r\n", "Requirement already satisfied: et-xmlfile in /home/fli/.local/lib/python3.9/site-packages (from openpyxl) (1.0.1)\r\n" ] } ], "source": [ "# Pandas requires external modules to read Excel files\n", "! pip3 install xlrd openpyxl --user" ] }, { "cell_type": "code", "execution_count": 53, "id": "bc15a3d7", "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "((272, 1), (288, 1))" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb1 = pd.read_excel('data/microbiome/MID1.xls', 'Sheet 1', index_col=0, header=None)\n", "mb2 = pd.read_excel('data/microbiome/MID2.xls', 'Sheet 1', index_col=0, header=None)\n", "mb1.shape, mb2.shape" ] }, { "cell_type": "code", "execution_count": 54, "id": "f348aed9", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1
0
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera7
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus2
Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus3
Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum3
Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella7
......
Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Kosmotoga9
Bacteria \"Verrucomicrobia\" Opitutae Opitutales Opitutaceae Alterococcus1
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae2
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae85
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta1388
\n", "

272 rows × 1 columns

\n", "
" ], "text/plain": [ " 1\n", "0 \n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2\n", "Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3\n", "Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3\n", "Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7\n", "... ...\n", "Bacteria \"Thermotogae\" Thermotogae Thermotogale... 9\n", "Bacteria \"Verrucomicrobia\" Opitutae Opitutales ... 1\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 2\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 85\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388\n", "\n", "[272 rows x 1 columns]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb1" ] }, { "cell_type": "markdown", "id": "408fb81f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Let's give the index and columns meaningful labels:" ] }, { "cell_type": "code", "execution_count": 55, "id": "d0125a78", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "mb1.columns = mb2.columns = ['Count']" ] }, { "cell_type": "code", "execution_count": 56, "id": "47a30002", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "mb1.index.name = mb2.index.name = 'Taxon'" ] }, { "cell_type": "code", "execution_count": 57, "id": "7cf2b2ab", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Taxon
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera7
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus2
Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus3
Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum3
Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella7
......
Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Kosmotoga9
Bacteria \"Verrucomicrobia\" Opitutae Opitutales Opitutaceae Alterococcus1
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae2
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae85
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta1388
\n", "

272 rows × 1 columns

\n", "
" ], "text/plain": [ " Count\n", "Taxon \n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2\n", "Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3\n", "Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3\n", "Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7\n", "... ...\n", "Bacteria \"Thermotogae\" Thermotogae Thermotogale... 9\n", "Bacteria \"Verrucomicrobia\" Opitutae Opitutales ... 1\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 2\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 85\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388\n", "\n", "[272 rows x 1 columns]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb1" ] }, { "cell_type": "code", "execution_count": 58, "id": "483dcac0", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Index(['Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera',\n", " 'Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus',\n", " 'Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus'],\n", " dtype='object', name='Taxon')" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb1.index[:3]" ] }, { "cell_type": "code", "execution_count": 59, "id": "6c35dc94", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb1.index.is_unique" ] }, { "cell_type": "markdown", "id": "8d2a5674", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If we concatenate along `axis=0` (the default), we will obtain another data frame with the the rows concatenated:" ] }, { "cell_type": "code", "execution_count": 60, "id": "d03a160f", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(560, 1)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([mb1, mb2], axis=0).shape" ] }, { "cell_type": "markdown", "id": "7f3bd46b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, the index is no longer unique, due to overlap between the two DataFrames." ] }, { "cell_type": "code", "execution_count": 61, "id": "bc511917", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([mb1, mb2], axis=0).index.is_unique" ] }, { "cell_type": "markdown", "id": "2d4057aa", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Concatenating along `axis=1` will concatenate column-wise, but respecting the indices of the two DataFrames." ] }, { "cell_type": "code", "execution_count": 62, "id": "100de543", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(438, 2)" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([mb1, mb2], axis=1).shape" ] }, { "cell_type": "code", "execution_count": 63, "id": "a515f6bd", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountCount
Taxon
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera7.023.0
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus2.02.0
Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus3.010.0
Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum3.09.0
Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella7.09.0
.........
Bacteria \"Proteobacteria\" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis SpongiispiraNaN1.0
Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae HydrogenovibrioNaN9.0
Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae SulfurivirgaNaN1.0
Bacteria \"Thermodesulfobacteria\" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae ThermodesulfatatorNaN3.0
Bacteria TM7 TM7_genera_incertae_sedisNaN2.0
\n", "

438 rows × 2 columns

\n", "
" ], "text/plain": [ " Count Count\n", "Taxon \n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7.0 23.0\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2.0 2.0\n", "Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3.0 10.0\n", "Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3.0 9.0\n", "Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7.0 9.0\n", "... ... ...\n", "Bacteria \"Proteobacteria\" Gammaproteobacteria O... NaN 1.0\n", "Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 9.0\n", "Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 1.0\n", "Bacteria \"Thermodesulfobacteria\" Thermodesulfob... NaN 3.0\n", "Bacteria TM7 TM7_genera_incertae_sedis NaN 2.0\n", "\n", "[438 rows x 2 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([mb1, mb2], axis=1)" ] }, { "cell_type": "code", "execution_count": 64, "id": "5613a854", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 7., 23.],\n", " [ 2., 2.],\n", " [ 3., 10.],\n", " [ 3., 9.],\n", " [ 7., 9.]])" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([mb1, mb2], axis=1).values[:5]" ] }, { "cell_type": "markdown", "id": "44b0adc2", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If we are only interested in taxa that are included in both DataFrames, we can specify a `join=inner` argument." ] }, { "cell_type": "code", "execution_count": 65, "id": "dc8b4304", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountCount
Taxon
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera723
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus22
Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus310
Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum39
Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella79
.........
Bacteria \"Thermodesulfobacteria\" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium11
Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Geotoga715
Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Kosmotoga922
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae851
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta13882
\n", "

122 rows × 2 columns

\n", "
" ], "text/plain": [ " Count Count\n", "Taxon \n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7 23\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2 2\n", "Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3 10\n", "Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3 9\n", "Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7 9\n", "... ... ...\n", "Bacteria \"Thermodesulfobacteria\" Thermodesulfob... 1 1\n", "Bacteria \"Thermotogae\" Thermotogae Thermotogale... 7 15\n", "Bacteria \"Thermotogae\" Thermotogae Thermotogale... 9 22\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 85 1\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388 2\n", "\n", "[122 rows x 2 columns]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([mb1, mb2], axis=1, join='inner')" ] }, { "cell_type": "markdown", "id": "9f463b07", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If we wanted to use the second table to fill values absent from the first table, we could use `combine_first`." ] }, { "cell_type": "code", "execution_count": 66, "id": "eae22a1d", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Taxon
Archaea \"Crenarchaeota\" Thermoprotei Acidilobales Acidilobaceae Acidilobus2
Archaea \"Crenarchaeota\" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera14
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera7
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus1
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera2
......
Bacteria \"Verrucomicrobia\" Opitutae Opitutales Opitutaceae Alterococcus1
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae2
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae85
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta1388
Bacteria TM7 TM7_genera_incertae_sedis2
\n", "

438 rows × 1 columns

\n", "
" ], "text/plain": [ " Count\n", "Taxon \n", "Archaea \"Crenarchaeota\" Thermoprotei Acidilobal... 2\n", "Archaea \"Crenarchaeota\" Thermoprotei Acidilobal... 14\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 1\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2\n", "... ...\n", "Bacteria \"Verrucomicrobia\" Opitutae Opitutales ... 1\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 2\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 85\n", "Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388\n", "Bacteria TM7 TM7_genera_incertae_sedis 2\n", "\n", "[438 rows x 1 columns]" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mb1.combine_first(mb2)" ] }, { "cell_type": "markdown", "id": "f11404c4", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Alternatively, you can pass keys to the concatenation by supplying the DataFrames (or Series) as a dict." ] }, { "cell_type": "code", "execution_count": 67, "id": "d96c0db5", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient1patient2
CountCount
Taxon
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera7.023.0
Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus2.02.0
Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus3.010.0
Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum3.09.0
Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella7.09.0
.........
Bacteria \"Proteobacteria\" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis SpongiispiraNaN1.0
Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae HydrogenovibrioNaN9.0
Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae SulfurivirgaNaN1.0
Bacteria \"Thermodesulfobacteria\" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae ThermodesulfatatorNaN3.0
Bacteria TM7 TM7_genera_incertae_sedisNaN2.0
\n", "

438 rows × 2 columns

\n", "
" ], "text/plain": [ " patient1 patient2\n", " Count Count\n", "Taxon \n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7.0 23.0\n", "Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2.0 2.0\n", "Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3.0 10.0\n", "Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3.0 9.0\n", "Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7.0 9.0\n", "... ... ...\n", "Bacteria \"Proteobacteria\" Gammaproteobacteria O... NaN 1.0\n", "Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 9.0\n", "Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 1.0\n", "Bacteria \"Thermodesulfobacteria\" Thermodesulfob... NaN 3.0\n", "Bacteria TM7 TM7_genera_incertae_sedis NaN 2.0\n", "\n", "[438 rows x 2 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat(dict(patient1=mb1, patient2=mb2), axis=1)" ] }, { "cell_type": "markdown", "id": "c86b6d65", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If you want `concat` to work like `numpy.concatanate`, you may provide the `ignore_index=True` argument." ] }, { "cell_type": "markdown", "id": "cf86c7c6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Missing data\n", "\n", "The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.\n", "\n", "Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (*e.g.* NumPy)." ] }, { "cell_type": "code", "execution_count": 68, "id": "d534af5b", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 -3\n", "2 None\n", "3 foobar\n", "dtype: object" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "foo = pd.Series([np.nan, -3, None, 'foobar'])\n", "foo" ] }, { "cell_type": "code", "execution_count": 69, "id": "f47f7613", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo.isnull()" ] }, { "cell_type": "markdown", "id": "1c83fa2b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Missing values may be dropped or indexed out:" ] }, { "cell_type": "code", "execution_count": 74, "id": "85eebf12", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Cyanobacteria NaN\n", "Firmicutes 632.0\n", "Proteobacteria 1638.0\n", "Actinobacteria 569.0\n", "dtype: float64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}\n", "bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])\n", "bacteria2\n" ] }, { "cell_type": "code", "execution_count": 75, "id": "ecfc04d7", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Firmicutes 632.0\n", "Proteobacteria 1638.0\n", "Actinobacteria 569.0\n", "dtype: float64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria2.dropna()" ] }, { "cell_type": "code", "execution_count": 76, "id": "9e055763", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Firmicutes 632.0\n", "Proteobacteria 1638.0\n", "Actinobacteria 569.0\n", "dtype: float64" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria2[bacteria2.notnull()]" ] }, { "cell_type": "markdown", "id": "017f85a1", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- This can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value." ] }, { "cell_type": "code", "execution_count": null, "id": "708a9a0a", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data.dropna(how='all')" ] }, { "cell_type": "markdown", "id": "f217a7b6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument." ] }, { "cell_type": "code", "execution_count": null, "id": "9bc37ea9", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data.dropna(thresh=4)" ] }, { "cell_type": "markdown", "id": "6fa5235a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.\n", "\n", "- If we want to drop missing values column-wise instead of row-wise, we use `axis=1`." ] }, { "cell_type": "code", "execution_count": null, "id": "e98b3b29", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data.dropna(axis=1)" ] }, { "cell_type": "markdown", "id": "8d49d8f4", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the `fillna` argument." ] }, { "cell_type": "code", "execution_count": 79, "id": "a4049207", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Cyanobacteria 0.0\n", "Firmicutes 632.0\n", "Proteobacteria 1638.0\n", "Actinobacteria 569.0\n", "dtype: float64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria2.fillna(0)" ] }, { "cell_type": "code", "execution_count": 89, "id": "8eb46c55", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valuetreatmentyear
06321.01994.0
116381.01997.0
25691.01999.0
31152.02013.0
44332.02015.0
511302.02017.0
67542.02019.0
75552.02021.0
\n", "
" ], "text/plain": [ " value treatment year\n", "0 632 1.0 1994.0\n", "1 1638 1.0 1997.0\n", "2 569 1.0 1999.0\n", "3 115 2.0 2013.0\n", "4 433 2.0 2015.0\n", "5 1130 2.0 2017.0\n", "6 754 2.0 2019.0\n", "7 555 2.0 2021.0" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],\n", " 'treatment':[1, 1, 1, None, 2, 2, 2, 2],\n", " 'year':[1994,1997,1999, None,2015,2017,2019,2021]})\n", "\n", "data.fillna({'year': 2013, 'treatment':2})" ] }, { "cell_type": "markdown", "id": "1023ad7e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Notice that `fillna` by default returns a new object with the desired filling behavior, rather than changing the `Series` or `DataFrame` in place.\n", "\n", "- We can alter values in-place using `inplace=True`." ] }, { "cell_type": "markdown", "id": "04ddff7c", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data aggregation and GroupBy operations\n", "\n", "One of the most powerful features of Pandas is its **GroupBy** functionality. On occasion we may want to perform operations on *groups* of observations within a dataset. For exmaple:\n", "\n", "* **aggregation**, such as computing the sum of mean of each group, which involves applying a function to each group and returning the aggregated results\n", "* **slicing** the DataFrame into groups and then doing something with the resulting slices (*e.g.* plotting)\n", "* group-wise **transformation**, such as standardization/normalization" ] }, { "cell_type": "code", "execution_count": 91, "id": "45250d06", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patientobsweeksiteidtreatagesextwstrs
0110115000U65F32
1122115000U65F30
2134115000U65F24
3148115000U65F37
41512115000U65F39
..............................
626109109115000U57M53
627109229115000U57M38
628109489115000U57M33
6291095129115000U57M36
6301096169115000U57M51
\n", "

631 rows × 9 columns

\n", "
" ], "text/plain": [ " patient obs week site id treat age sex twstrs\n", "0 1 1 0 1 1 5000U 65 F 32\n", "1 1 2 2 1 1 5000U 65 F 30\n", "2 1 3 4 1 1 5000U 65 F 24\n", "3 1 4 8 1 1 5000U 65 F 37\n", "4 1 5 12 1 1 5000U 65 F 39\n", ".. ... ... ... ... .. ... ... .. ...\n", "626 109 1 0 9 11 5000U 57 M 53\n", "627 109 2 2 9 11 5000U 57 M 38\n", "628 109 4 8 9 11 5000U 57 M 33\n", "629 109 5 12 9 11 5000U 57 M 36\n", "630 109 6 16 9 11 5000U 57 M 51\n", "\n", "[631 rows x 9 columns]" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdystonia = pd.read_csv(\"data/cdystonia.csv\")\n", "cdystonia" ] }, { "cell_type": "code", "execution_count": 92, "id": "f75c1ce2", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdystonia_grouped = cdystonia.groupby(cdystonia.patient)\n", "cdystonia_grouped" ] }, { "cell_type": "markdown", "id": "98c33a39", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- However, the grouping is only an intermediate step; for example, we may want to **iterate** over each of the patient groups:" ] }, { "cell_type": "code", "execution_count": 93, "id": "08779a9c", "metadata": { "scrolled": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# for patient, group in cdystonia_grouped:\n", "# print(patient)\n", "# print(group)" ] }, { "cell_type": "markdown", "id": "5d96b068", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- A common data analysis procedure is the **split-apply-combine** operation, which groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table. For example, we may want to aggregate our data with with some function.\n", "\n", "- We can aggregate in Pandas using the `aggregate` (or `agg`, for short) method:" ] }, { "cell_type": "code", "execution_count": 94, "id": "bde72bdc", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obsweeksiteidagetwstrs
patient
13.57.01.01.065.033.000000
23.57.01.02.070.047.666667
33.57.01.03.064.030.500000
42.53.51.04.059.060.000000
53.57.01.05.076.046.166667
.....................
1053.57.09.07.079.043.666667
1063.57.09.08.043.067.666667
1073.57.09.09.050.042.000000
1083.57.09.010.039.052.333333
1093.67.69.011.057.042.200000
\n", "

109 rows × 6 columns

\n", "
" ], "text/plain": [ " obs week site id age twstrs\n", "patient \n", "1 3.5 7.0 1.0 1.0 65.0 33.000000\n", "2 3.5 7.0 1.0 2.0 70.0 47.666667\n", "3 3.5 7.0 1.0 3.0 64.0 30.500000\n", "4 2.5 3.5 1.0 4.0 59.0 60.000000\n", "5 3.5 7.0 1.0 5.0 76.0 46.166667\n", "... ... ... ... ... ... ...\n", "105 3.5 7.0 9.0 7.0 79.0 43.666667\n", "106 3.5 7.0 9.0 8.0 43.0 67.666667\n", "107 3.5 7.0 9.0 9.0 50.0 42.000000\n", "108 3.5 7.0 9.0 10.0 39.0 52.333333\n", "109 3.6 7.6 9.0 11.0 57.0 42.200000\n", "\n", "[109 rows x 6 columns]" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "cdystonia_grouped.agg(np.mean)" ] }, { "cell_type": "markdown", "id": "cb55ed92", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Notice that the `treat` and `sex` variables are not included in the aggregation. Since it does not make sense to aggregate non-string variables, these columns are simply ignored by the method.\n", "\n", "- Some aggregation functions are so common that Pandas has a convenience method for them, such as `mean`:" ] }, { "cell_type": "code", "execution_count": 95, "id": "9ec0d138", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obsweeksiteidagetwstrs
patient
13.57.01.01.065.033.000000
23.57.01.02.070.047.666667
33.57.01.03.064.030.500000
42.53.51.04.059.060.000000
53.57.01.05.076.046.166667
.....................
1053.57.09.07.079.043.666667
1063.57.09.08.043.067.666667
1073.57.09.09.050.042.000000
1083.57.09.010.039.052.333333
1093.67.69.011.057.042.200000
\n", "

109 rows × 6 columns

\n", "
" ], "text/plain": [ " obs week site id age twstrs\n", "patient \n", "1 3.5 7.0 1.0 1.0 65.0 33.000000\n", "2 3.5 7.0 1.0 2.0 70.0 47.666667\n", "3 3.5 7.0 1.0 3.0 64.0 30.500000\n", "4 2.5 3.5 1.0 4.0 59.0 60.000000\n", "5 3.5 7.0 1.0 5.0 76.0 46.166667\n", "... ... ... ... ... ... ...\n", "105 3.5 7.0 9.0 7.0 79.0 43.666667\n", "106 3.5 7.0 9.0 8.0 43.0 67.666667\n", "107 3.5 7.0 9.0 9.0 50.0 42.000000\n", "108 3.5 7.0 9.0 10.0 39.0 52.333333\n", "109 3.6 7.6 9.0 11.0 57.0 42.200000\n", "\n", "[109 rows x 6 columns]" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdystonia_grouped.mean()" ] }, { "cell_type": "markdown", "id": "2c1ee1ff", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- The `add_prefix` and `add_suffix` methods can be used to give the columns of the resulting table labels that reflect the transformation:" ] }, { "cell_type": "code", "execution_count": 96, "id": "d12a15bf", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obs_meanweek_meansite_meanid_meanage_meantwstrs_mean
patient
13.57.01.01.065.033.000000
23.57.01.02.070.047.666667
33.57.01.03.064.030.500000
42.53.51.04.059.060.000000
53.57.01.05.076.046.166667
.....................
1053.57.09.07.079.043.666667
1063.57.09.08.043.067.666667
1073.57.09.09.050.042.000000
1083.57.09.010.039.052.333333
1093.67.69.011.057.042.200000
\n", "

109 rows × 6 columns

\n", "
" ], "text/plain": [ " obs_mean week_mean site_mean id_mean age_mean twstrs_mean\n", "patient \n", "1 3.5 7.0 1.0 1.0 65.0 33.000000\n", "2 3.5 7.0 1.0 2.0 70.0 47.666667\n", "3 3.5 7.0 1.0 3.0 64.0 30.500000\n", "4 2.5 3.5 1.0 4.0 59.0 60.000000\n", "5 3.5 7.0 1.0 5.0 76.0 46.166667\n", "... ... ... ... ... ... ...\n", "105 3.5 7.0 9.0 7.0 79.0 43.666667\n", "106 3.5 7.0 9.0 8.0 43.0 67.666667\n", "107 3.5 7.0 9.0 9.0 50.0 42.000000\n", "108 3.5 7.0 9.0 10.0 39.0 52.333333\n", "109 3.6 7.6 9.0 11.0 57.0 42.200000\n", "\n", "[109 rows x 6 columns]" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdystonia_grouped.mean().add_suffix('_mean')" ] }, { "cell_type": "code", "execution_count": 97, "id": "313219c9", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "patient\n", "1 34.0\n", "2 50.5\n", "3 30.5\n", "4 61.5\n", "5 48.5\n", " ... \n", "105 45.5\n", "106 67.5\n", "107 44.0\n", "108 50.5\n", "109 38.0\n", "Name: twstrs, Length: 109, dtype: float64" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The median of the `twstrs` variable\n", "cdystonia_grouped['twstrs'].quantile(0.5)" ] }, { "cell_type": "markdown", "id": "390ffe52", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If we wish, we can easily aggregate according to multiple keys:" ] }, { "cell_type": "code", "execution_count": 110, "id": "a0d1e443", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patientobsidagetwstrs
weeksite
016.51.06.559.00000043.083333
219.51.07.553.92857151.857143
332.51.06.551.50000038.750000
442.51.04.559.25000048.125000
549.51.03.551.83333349.333333
660.01.08.051.86666749.400000
773.51.06.559.25000044.333333
889.01.010.057.26315838.631579
9104.01.06.055.45454552.727273
216.52.06.559.00000031.083333
219.02.07.052.92307748.769231
332.52.06.551.50000032.416667
442.52.04.559.25000039.125000
549.02.03.050.00000044.200000
660.02.08.051.86666744.066667
773.52.06.559.25000032.916667
888.52.09.558.56250029.500000
9103.72.05.756.00000041.600000
416.53.06.559.00000033.333333
219.53.07.553.92857148.785714
\n", "
" ], "text/plain": [ " patient obs id age twstrs\n", "week site \n", "0 1 6.5 1.0 6.5 59.000000 43.083333\n", " 2 19.5 1.0 7.5 53.928571 51.857143\n", " 3 32.5 1.0 6.5 51.500000 38.750000\n", " 4 42.5 1.0 4.5 59.250000 48.125000\n", " 5 49.5 1.0 3.5 51.833333 49.333333\n", " 6 60.0 1.0 8.0 51.866667 49.400000\n", " 7 73.5 1.0 6.5 59.250000 44.333333\n", " 8 89.0 1.0 10.0 57.263158 38.631579\n", " 9 104.0 1.0 6.0 55.454545 52.727273\n", "2 1 6.5 2.0 6.5 59.000000 31.083333\n", " 2 19.0 2.0 7.0 52.923077 48.769231\n", " 3 32.5 2.0 6.5 51.500000 32.416667\n", " 4 42.5 2.0 4.5 59.250000 39.125000\n", " 5 49.0 2.0 3.0 50.000000 44.200000\n", " 6 60.0 2.0 8.0 51.866667 44.066667\n", " 7 73.5 2.0 6.5 59.250000 32.916667\n", " 8 88.5 2.0 9.5 58.562500 29.500000\n", " 9 103.7 2.0 5.7 56.000000 41.600000\n", "4 1 6.5 3.0 6.5 59.000000 33.333333\n", " 2 19.5 3.0 7.5 53.928571 48.785714" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdystonia.groupby(['week','site']).mean()" ] }, { "cell_type": "markdown", "id": "838bc9d9", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Alternately, we can **transform** the data, using a function of our choice with the `transform` method:" ] }, { "cell_type": "code", "execution_count": 107, "id": "684b5283", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obsweektwstrs
0-1.336306-1.135550-0.181369
1-0.801784-0.811107-0.544107
2-0.267261-0.486664-1.632322
30.2672610.1622210.725476
40.8017840.8111071.088214
............
626-1.253831-1.1354671.180487
627-0.771589-0.836660-0.459078
6280.1928970.059761-1.005600
6290.6751400.657376-0.677687
6301.1573831.2549900.961878
\n", "

631 rows × 3 columns

\n", "
" ], "text/plain": [ " obs week twstrs\n", "0 -1.336306 -1.135550 -0.181369\n", "1 -0.801784 -0.811107 -0.544107\n", "2 -0.267261 -0.486664 -1.632322\n", "3 0.267261 0.162221 0.725476\n", "4 0.801784 0.811107 1.088214\n", ".. ... ... ...\n", "626 -1.253831 -1.135467 1.180487\n", "627 -0.771589 -0.836660 -0.459078\n", "628 0.192897 0.059761 -1.005600\n", "629 0.675140 0.657376 -0.677687\n", "630 1.157383 1.254990 0.961878\n", "\n", "[631 rows x 3 columns]" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdystonia2 = cdystonia_grouped[[\"obs\", \"week\", \"twstrs\"]]\n", "normalize = lambda x: (x - x.mean())/x.std()\n", "\n", "cdystonia2.transform(normalize)" ] }, { "cell_type": "markdown", "id": "171b910c", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- It is easy to do column selection within `groupby` operations, if we are only interested split-apply-combine operations on a subset of columns:" ] }, { "cell_type": "code", "execution_count": 100, "id": "73001faa", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "patient\n", "1 33.000000\n", "2 47.666667\n", "3 30.500000\n", "4 60.000000\n", "5 46.166667\n", " ... \n", "105 43.666667\n", "106 67.666667\n", "107 42.000000\n", "108 52.333333\n", "109 42.200000\n", "Name: twstrs, Length: 109, dtype: float64" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdystonia_grouped['twstrs'].mean()" ] }, { "cell_type": "markdown", "id": "a05c2b75", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If you simply want to divide your DataFrame into chunks for later use, its easy to convert them into a dict so that they can be easily indexed out as needed:" ] }, { "cell_type": "code", "execution_count": 101, "id": "76648337", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patientobsweeksiteidtreatagesextwstrs
1841014Placebo59F53
1942214Placebo59F61
2043414Placebo59F64
2144814Placebo59F62
\n", "
" ], "text/plain": [ " patient obs week site id treat age sex twstrs\n", "18 4 1 0 1 4 Placebo 59 F 53\n", "19 4 2 2 1 4 Placebo 59 F 61\n", "20 4 3 4 1 4 Placebo 59 F 64\n", "21 4 4 8 1 4 Placebo 59 F 62" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunks = dict(list(cdystonia_grouped))\n", "chunks[4]" ] }, { "cell_type": "markdown", "id": "e19eaf3e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- By default, `groupby` groups by row, but we can specify the `axis` argument to change this. For example, we can group our columns by type this way:" ] }, { "cell_type": "code", "execution_count": 102, "id": "28465087", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{dtype('int64'): patient obs week site id age twstrs\n", " 0 1 1 0 1 1 65 32\n", " 1 1 2 2 1 1 65 30\n", " 2 1 3 4 1 1 65 24\n", " 3 1 4 8 1 1 65 37\n", " 4 1 5 12 1 1 65 39\n", " .. ... ... ... ... .. ... ...\n", " 626 109 1 0 9 11 57 53\n", " 627 109 2 2 9 11 57 38\n", " 628 109 4 8 9 11 57 33\n", " 629 109 5 12 9 11 57 36\n", " 630 109 6 16 9 11 57 51\n", " \n", " [631 rows x 7 columns],\n", " dtype('O'): treat sex\n", " 0 5000U F\n", " 1 5000U F\n", " 2 5000U F\n", " 3 5000U F\n", " 4 5000U F\n", " .. ... ..\n", " 626 5000U M\n", " 627 5000U M\n", " 628 5000U M\n", " 629 5000U M\n", " 630 5000U M\n", " \n", " [631 rows x 2 columns]}" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict(list(cdystonia.groupby(cdystonia.dtypes, axis=1)))" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "rise": { "auto_select": "first", "autolaunch": false, "enable_chalkboard": true, "start_slideshow_at": "selected", "theme": "black" } }, "nbformat": 4, "nbformat_minor": 5 }