{
"cells": [
{
"cell_type": "markdown",
"id": "b3cf3c09",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Manipulating DataFrames with Pandas\n",
"\n",
"Feng Li\n",
"\n",
"School of Statistics and Mathematics\n",
"\n",
"Central University of Finance and Economics\n",
"\n",
"[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n",
"\n",
"[https://feng.li/python](https://feng.li/python)"
]
},
{
"cell_type": "markdown",
"id": "05cd1b5b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Importing data\n",
"\n",
"- A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. \n",
"\n",
"- Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported."
]
},
{
"cell_type": "markdown",
"id": "69e28db5",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Read extermal data\n",
"\n",
"- There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries. \n",
"\n",
"- These include \n",
" - CSV: `read_csv()`\n",
" - Excel: `read_excel()`\n",
" - JSON: `read_json()`\n",
" - Parquet Format: `read_parquet()`\n",
" - Stata: `read_stata()`\n",
" - ...\n",
" \n",
"These are beyond the scope of this tutorial, but are covered in https://pandas.pydata.org/docs/user_guide/io.html ."
]
},
{
"cell_type": "markdown",
"id": "99f2153c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let's start with some more bacteria data, stored in csv format."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a2ea256e",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Taxon,Patient,Tissue,Stool\r\n",
"Firmicutes,1,632,305\r\n",
"Firmicutes,2,136,4182\r\n",
"Firmicutes,3,1174,703\r\n",
"Firmicutes,4,408,3946\r\n",
"Firmicutes,5,831,8605\r\n",
"Firmicutes,6,693,50\r\n",
"Firmicutes,7,718,717\r\n",
"Firmicutes,8,173,33\r\n",
"Firmicutes,9,228,80\r\n"
]
}
],
"source": [
"! head data/microbiome.csv"
]
},
{
"cell_type": "markdown",
"id": "919d65a6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- This table can be read into a DataFrame using `read_csv`:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d5fd0757",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Taxon | \n",
" Patient | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Firmicutes | \n",
" 1 | \n",
" 632 | \n",
" 305 | \n",
"
\n",
" \n",
" 1 | \n",
" Firmicutes | \n",
" 2 | \n",
" 136 | \n",
" 4182 | \n",
"
\n",
" \n",
" 2 | \n",
" Firmicutes | \n",
" 3 | \n",
" 1174 | \n",
" 703 | \n",
"
\n",
" \n",
" 3 | \n",
" Firmicutes | \n",
" 4 | \n",
" 408 | \n",
" 3946 | \n",
"
\n",
" \n",
" 4 | \n",
" Firmicutes | \n",
" 5 | \n",
" 831 | \n",
" 8605 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 70 | \n",
" Other | \n",
" 11 | \n",
" 203 | \n",
" 6 | \n",
"
\n",
" \n",
" 71 | \n",
" Other | \n",
" 12 | \n",
" 392 | \n",
" 6 | \n",
"
\n",
" \n",
" 72 | \n",
" Other | \n",
" 13 | \n",
" 28 | \n",
" 25 | \n",
"
\n",
" \n",
" 73 | \n",
" Other | \n",
" 14 | \n",
" 12 | \n",
" 22 | \n",
"
\n",
" \n",
" 74 | \n",
" Other | \n",
" 15 | \n",
" 305 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
75 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Taxon Patient Tissue Stool\n",
"0 Firmicutes 1 632 305\n",
"1 Firmicutes 2 136 4182\n",
"2 Firmicutes 3 1174 703\n",
"3 Firmicutes 4 408 3946\n",
"4 Firmicutes 5 831 8605\n",
".. ... ... ... ...\n",
"70 Other 11 203 6\n",
"71 Other 12 392 6\n",
"72 Other 13 28 25\n",
"73 Other 14 12 22\n",
"74 Other 15 305 32\n",
"\n",
"[75 rows x 4 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"mb = pd.read_csv(\"microbiome.csv\")\n",
"mb "
]
},
{
"cell_type": "markdown",
"id": "7999d005",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that `read_csv` automatically considered the first row in the file to be a header row. We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4e5ba16d",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Taxon | \n",
" Patient | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" 1 | \n",
" Firmicutes | \n",
" 1 | \n",
" 632 | \n",
" 305 | \n",
"
\n",
" \n",
" 2 | \n",
" Firmicutes | \n",
" 2 | \n",
" 136 | \n",
" 4182 | \n",
"
\n",
" \n",
" 3 | \n",
" Firmicutes | \n",
" 3 | \n",
" 1174 | \n",
" 703 | \n",
"
\n",
" \n",
" 4 | \n",
" Firmicutes | \n",
" 4 | \n",
" 408 | \n",
" 3946 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 71 | \n",
" Other | \n",
" 11 | \n",
" 203 | \n",
" 6 | \n",
"
\n",
" \n",
" 72 | \n",
" Other | \n",
" 12 | \n",
" 392 | \n",
" 6 | \n",
"
\n",
" \n",
" 73 | \n",
" Other | \n",
" 13 | \n",
" 28 | \n",
" 25 | \n",
"
\n",
" \n",
" 74 | \n",
" Other | \n",
" 14 | \n",
" 12 | \n",
" 22 | \n",
"
\n",
" \n",
" 75 | \n",
" Other | \n",
" 15 | \n",
" 305 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
76 rows × 4 columns
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"0 Taxon Patient Tissue Stool\n",
"1 Firmicutes 1 632 305\n",
"2 Firmicutes 2 136 4182\n",
"3 Firmicutes 3 1174 703\n",
"4 Firmicutes 4 408 3946\n",
".. ... ... ... ...\n",
"71 Other 11 203 6\n",
"72 Other 12 392 6\n",
"73 Other 13 28 25\n",
"74 Other 14 12 22\n",
"75 Other 15 305 32\n",
"\n",
"[76 rows x 4 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"data/microbiome.csv\", header=None)"
]
},
{
"cell_type": "markdown",
"id": "d9c383f6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- For a more useful index, we can specify the first two columns, which together provide a unique index to the data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "aecf71d6",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" Taxon | \n",
" Patient | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Firmicutes | \n",
" 1 | \n",
" 632 | \n",
" 305 | \n",
"
\n",
" \n",
" 2 | \n",
" 136 | \n",
" 4182 | \n",
"
\n",
" \n",
" 3 | \n",
" 1174 | \n",
" 703 | \n",
"
\n",
" \n",
" 4 | \n",
" 408 | \n",
" 3946 | \n",
"
\n",
" \n",
" 5 | \n",
" 831 | \n",
" 8605 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Other | \n",
" 11 | \n",
" 203 | \n",
" 6 | \n",
"
\n",
" \n",
" 12 | \n",
" 392 | \n",
" 6 | \n",
"
\n",
" \n",
" 13 | \n",
" 28 | \n",
" 25 | \n",
"
\n",
" \n",
" 14 | \n",
" 12 | \n",
" 22 | \n",
"
\n",
" \n",
" 15 | \n",
" 305 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
75 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Tissue Stool\n",
"Taxon Patient \n",
"Firmicutes 1 632 305\n",
" 2 136 4182\n",
" 3 1174 703\n",
" 4 408 3946\n",
" 5 831 8605\n",
"... ... ...\n",
"Other 11 203 6\n",
" 12 392 6\n",
" 13 28 25\n",
" 14 12 22\n",
" 15 305 32\n",
"\n",
"[75 rows x 2 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mb = pd.read_csv(\"data/microbiome.csv\", index_col=['Taxon','Patient'])\n",
"mb"
]
},
{
"cell_type": "markdown",
"id": "b871f14b",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- This is called a *hierarchical* index, which we will revisit later in the tutorial."
]
},
{
"cell_type": "markdown",
"id": "1a44f9a6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument. This is useful for large dataset."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "94ba2549",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Taxon | \n",
" Patient | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Firmicutes | \n",
" 1 | \n",
" 632 | \n",
" 305 | \n",
"
\n",
" \n",
" 1 | \n",
" Firmicutes | \n",
" 2 | \n",
" 136 | \n",
" 4182 | \n",
"
\n",
" \n",
" 2 | \n",
" Firmicutes | \n",
" 5 | \n",
" 831 | \n",
" 8605 | \n",
"
\n",
" \n",
" 3 | \n",
" Firmicutes | \n",
" 7 | \n",
" 718 | \n",
" 717 | \n",
"
\n",
" \n",
" 4 | \n",
" Firmicutes | \n",
" 8 | \n",
" 173 | \n",
" 33 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 67 | \n",
" Other | \n",
" 11 | \n",
" 203 | \n",
" 6 | \n",
"
\n",
" \n",
" 68 | \n",
" Other | \n",
" 12 | \n",
" 392 | \n",
" 6 | \n",
"
\n",
" \n",
" 69 | \n",
" Other | \n",
" 13 | \n",
" 28 | \n",
" 25 | \n",
"
\n",
" \n",
" 70 | \n",
" Other | \n",
" 14 | \n",
" 12 | \n",
" 22 | \n",
"
\n",
" \n",
" 71 | \n",
" Other | \n",
" 15 | \n",
" 305 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
72 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Taxon Patient Tissue Stool\n",
"0 Firmicutes 1 632 305\n",
"1 Firmicutes 2 136 4182\n",
"2 Firmicutes 5 831 8605\n",
"3 Firmicutes 7 718 717\n",
"4 Firmicutes 8 173 33\n",
".. ... ... ... ...\n",
"67 Other 11 203 6\n",
"68 Other 12 392 6\n",
"69 Other 13 28 25\n",
"70 Other 14 12 22\n",
"71 Other 15 305 32\n",
"\n",
"[72 rows x 4 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"data/microbiome.csv\", skiprows=[3,4,6])"
]
},
{
"cell_type": "markdown",
"id": "33b0d12a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows` to retrive the first `nrows`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "57b50886",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Taxon | \n",
" Patient | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Firmicutes | \n",
" 1 | \n",
" 632 | \n",
" 305 | \n",
"
\n",
" \n",
" 1 | \n",
" Firmicutes | \n",
" 2 | \n",
" 136 | \n",
" 4182 | \n",
"
\n",
" \n",
" 2 | \n",
" Firmicutes | \n",
" 3 | \n",
" 1174 | \n",
" 703 | \n",
"
\n",
" \n",
" 3 | \n",
" Firmicutes | \n",
" 4 | \n",
" 408 | \n",
" 3946 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Taxon Patient Tissue Stool\n",
"0 Firmicutes 1 632 305\n",
"1 Firmicutes 2 136 4182\n",
"2 Firmicutes 3 1174 703\n",
"3 Firmicutes 4 408 3946"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"data/microbiome.csv\", nrows=4)"
]
},
{
"cell_type": "markdown",
"id": "c04905d5",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Alternately, if we want to process our data in reasonable chunks, the `chunksize` argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "bbe00897",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_chunks = pd.read_csv(\"data/microbiome.csv\", chunksize=15)\n",
"data_chunks"
]
},
{
"cell_type": "markdown",
"id": "b5de7a79",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Missing values\n",
"\n",
"- Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "67a1927b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Taxon,Patient,Tissue,Stool\r\n",
"Firmicutes,1,632,305\r\n",
"Firmicutes,2,136,4182\r\n",
"Firmicutes,3,,703\r\n",
"Firmicutes,4,408,3946\r\n",
"Firmicutes,5,831,8605\r\n",
"Firmicutes,6,693,50\r\n",
"Firmicutes,7,718,717\r\n",
"Firmicutes,8,173,33\r\n",
"Firmicutes,9,228,NA\r\n"
]
}
],
"source": [
"!head data/microbiome_missing.csv"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8634773b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Taxon | \n",
" Patient | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Firmicutes | \n",
" 1 | \n",
" 632 | \n",
" 305.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Firmicutes | \n",
" 2 | \n",
" 136 | \n",
" 4182.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Firmicutes | \n",
" 3 | \n",
" NaN | \n",
" 703.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Firmicutes | \n",
" 4 | \n",
" 408 | \n",
" 3946.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Firmicutes | \n",
" 5 | \n",
" 831 | \n",
" 8605.0 | \n",
"
\n",
" \n",
" 5 | \n",
" Firmicutes | \n",
" 6 | \n",
" 693 | \n",
" 50.0 | \n",
"
\n",
" \n",
" 6 | \n",
" Firmicutes | \n",
" 7 | \n",
" 718 | \n",
" 717.0 | \n",
"
\n",
" \n",
" 7 | \n",
" Firmicutes | \n",
" 8 | \n",
" 173 | \n",
" 33.0 | \n",
"
\n",
" \n",
" 8 | \n",
" Firmicutes | \n",
" 9 | \n",
" 228 | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" Firmicutes | \n",
" 10 | \n",
" 162 | \n",
" 3196.0 | \n",
"
\n",
" \n",
" 10 | \n",
" Firmicutes | \n",
" 11 | \n",
" 372 | \n",
" -99999.0 | \n",
"
\n",
" \n",
" 11 | \n",
" Firmicutes | \n",
" 12 | \n",
" 4255 | \n",
" 4361.0 | \n",
"
\n",
" \n",
" 12 | \n",
" Firmicutes | \n",
" 13 | \n",
" 107 | \n",
" 1667.0 | \n",
"
\n",
" \n",
" 13 | \n",
" Firmicutes | \n",
" 14 | \n",
" ? | \n",
" 223.0 | \n",
"
\n",
" \n",
" 14 | \n",
" Firmicutes | \n",
" 15 | \n",
" 281 | \n",
" 2377.0 | \n",
"
\n",
" \n",
" 15 | \n",
" Proteobacteria | \n",
" 1 | \n",
" 1638 | \n",
" 3886.0 | \n",
"
\n",
" \n",
" 16 | \n",
" Proteobacteria | \n",
" 2 | \n",
" 2469 | \n",
" 1821.0 | \n",
"
\n",
" \n",
" 17 | \n",
" Proteobacteria | \n",
" 3 | \n",
" 839 | \n",
" 661.0 | \n",
"
\n",
" \n",
" 18 | \n",
" Proteobacteria | \n",
" 4 | \n",
" 4414 | \n",
" 18.0 | \n",
"
\n",
" \n",
" 19 | \n",
" Proteobacteria | \n",
" 5 | \n",
" 12044 | \n",
" 83.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Taxon Patient Tissue Stool\n",
"0 Firmicutes 1 632 305.0\n",
"1 Firmicutes 2 136 4182.0\n",
"2 Firmicutes 3 NaN 703.0\n",
"3 Firmicutes 4 408 3946.0\n",
"4 Firmicutes 5 831 8605.0\n",
"5 Firmicutes 6 693 50.0\n",
"6 Firmicutes 7 718 717.0\n",
"7 Firmicutes 8 173 33.0\n",
"8 Firmicutes 9 228 NaN\n",
"9 Firmicutes 10 162 3196.0\n",
"10 Firmicutes 11 372 -99999.0\n",
"11 Firmicutes 12 4255 4361.0\n",
"12 Firmicutes 13 107 1667.0\n",
"13 Firmicutes 14 ? 223.0\n",
"14 Firmicutes 15 281 2377.0\n",
"15 Proteobacteria 1 1638 3886.0\n",
"16 Proteobacteria 2 2469 1821.0\n",
"17 Proteobacteria 3 839 661.0\n",
"18 Proteobacteria 4 4414 18.0\n",
"19 Proteobacteria 5 12044 83.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"pd.read_csv(\"data/microbiome_missing.csv\").head(20)"
]
},
{
"cell_type": "markdown",
"id": "fdb640c0",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Above, Pandas recognized `NA` and an empty field as missing data."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "729dcfcf",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Taxon | \n",
" Patient | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 5 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 6 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 7 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 8 | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 9 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 10 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 11 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 13 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 14 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 15 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 16 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 17 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 18 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 19 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Taxon Patient Tissue Stool\n",
"0 False False False False\n",
"1 False False False False\n",
"2 False False True False\n",
"3 False False False False\n",
"4 False False False False\n",
"5 False False False False\n",
"6 False False False False\n",
"7 False False False False\n",
"8 False False False True\n",
"9 False False False False\n",
"10 False False False False\n",
"11 False False False False\n",
"12 False False False False\n",
"13 False False False False\n",
"14 False False False False\n",
"15 False False False False\n",
"16 False False False False\n",
"17 False False False False\n",
"18 False False False False\n",
"19 False False False False"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.isnull(pd.read_csv(\"data/microbiome_missing.csv\")).head(20)"
]
},
{
"cell_type": "markdown",
"id": "eb7044f3",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark \"?\" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "e12e3438",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Taxon | \n",
" Patient | \n",
" Tissue | \n",
" Stool | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Firmicutes | \n",
" 1 | \n",
" 632.0 | \n",
" 305.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Firmicutes | \n",
" 2 | \n",
" 136.0 | \n",
" 4182.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Firmicutes | \n",
" 3 | \n",
" NaN | \n",
" 703.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Firmicutes | \n",
" 4 | \n",
" 408.0 | \n",
" 3946.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Firmicutes | \n",
" 5 | \n",
" 831.0 | \n",
" 8605.0 | \n",
"
\n",
" \n",
" 5 | \n",
" Firmicutes | \n",
" 6 | \n",
" 693.0 | \n",
" 50.0 | \n",
"
\n",
" \n",
" 6 | \n",
" Firmicutes | \n",
" 7 | \n",
" 718.0 | \n",
" 717.0 | \n",
"
\n",
" \n",
" 7 | \n",
" Firmicutes | \n",
" 8 | \n",
" 173.0 | \n",
" 33.0 | \n",
"
\n",
" \n",
" 8 | \n",
" Firmicutes | \n",
" 9 | \n",
" 228.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" Firmicutes | \n",
" 10 | \n",
" 162.0 | \n",
" 3196.0 | \n",
"
\n",
" \n",
" 10 | \n",
" Firmicutes | \n",
" 11 | \n",
" 372.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 11 | \n",
" Firmicutes | \n",
" 12 | \n",
" 4255.0 | \n",
" 4361.0 | \n",
"
\n",
" \n",
" 12 | \n",
" Firmicutes | \n",
" 13 | \n",
" 107.0 | \n",
" 1667.0 | \n",
"
\n",
" \n",
" 13 | \n",
" Firmicutes | \n",
" 14 | \n",
" NaN | \n",
" 223.0 | \n",
"
\n",
" \n",
" 14 | \n",
" Firmicutes | \n",
" 15 | \n",
" 281.0 | \n",
" 2377.0 | \n",
"
\n",
" \n",
" 15 | \n",
" Proteobacteria | \n",
" 1 | \n",
" 1638.0 | \n",
" 3886.0 | \n",
"
\n",
" \n",
" 16 | \n",
" Proteobacteria | \n",
" 2 | \n",
" 2469.0 | \n",
" 1821.0 | \n",
"
\n",
" \n",
" 17 | \n",
" Proteobacteria | \n",
" 3 | \n",
" 839.0 | \n",
" 661.0 | \n",
"
\n",
" \n",
" 18 | \n",
" Proteobacteria | \n",
" 4 | \n",
" 4414.0 | \n",
" 18.0 | \n",
"
\n",
" \n",
" 19 | \n",
" Proteobacteria | \n",
" 5 | \n",
" 12044.0 | \n",
" 83.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Taxon Patient Tissue Stool\n",
"0 Firmicutes 1 632.0 305.0\n",
"1 Firmicutes 2 136.0 4182.0\n",
"2 Firmicutes 3 NaN 703.0\n",
"3 Firmicutes 4 408.0 3946.0\n",
"4 Firmicutes 5 831.0 8605.0\n",
"5 Firmicutes 6 693.0 50.0\n",
"6 Firmicutes 7 718.0 717.0\n",
"7 Firmicutes 8 173.0 33.0\n",
"8 Firmicutes 9 228.0 NaN\n",
"9 Firmicutes 10 162.0 3196.0\n",
"10 Firmicutes 11 372.0 NaN\n",
"11 Firmicutes 12 4255.0 4361.0\n",
"12 Firmicutes 13 107.0 1667.0\n",
"13 Firmicutes 14 NaN 223.0\n",
"14 Firmicutes 15 281.0 2377.0\n",
"15 Proteobacteria 1 1638.0 3886.0\n",
"16 Proteobacteria 2 2469.0 1821.0\n",
"17 Proteobacteria 3 839.0 661.0\n",
"18 Proteobacteria 4 4414.0 18.0\n",
"19 Proteobacteria 5 12044.0 83.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"data/microbiome_missing.csv\", na_values=['?', -99999]).head(20)"
]
},
{
"cell_type": "markdown",
"id": "0e212c61",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"These can be specified on a column-wise basis using an appropriate dict as the argument for `na_values`."
]
},
{
"cell_type": "markdown",
"id": "17109474",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Manipulating indices\n",
"\n",
"**Reindexing** allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested. \n",
"\n",
"For some variety, we will leave our digestive tract bacteria behind and employ some baseball data."
]
},
{
"cell_type": "markdown",
"id": "3280a3bd",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Specify an unique index"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "756910fd",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 88641 | \n",
" womacto01 | \n",
" 2006 | \n",
" 2 | \n",
" CHN | \n",
" NL | \n",
" 19 | \n",
" 50 | \n",
" 6 | \n",
" 14 | \n",
" 1 | \n",
" ... | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88643 | \n",
" schilcu01 | \n",
" 2006 | \n",
" 1 | \n",
" BOS | \n",
" AL | \n",
" 31 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88645 | \n",
" myersmi01 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 62 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88649 | \n",
" helliri01 | \n",
" 2006 | \n",
" 1 | \n",
" MIL | \n",
" NL | \n",
" 20 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88650 | \n",
" johnsra05 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 33 | \n",
" 6 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89525 | \n",
" benitar01 | \n",
" 2007 | \n",
" 2 | \n",
" FLO | \n",
" NL | \n",
" 34 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 89526 | \n",
" benitar01 | \n",
" 2007 | \n",
" 1 | \n",
" SFN | \n",
" NL | \n",
" 19 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 89530 | \n",
" ausmubr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 117 | \n",
" 349 | \n",
" 38 | \n",
" 82 | \n",
" 16 | \n",
" ... | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37 | \n",
" 74.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" 89533 | \n",
" aloumo01 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 87 | \n",
" 328 | \n",
" 51 | \n",
" 112 | \n",
" 19 | \n",
" ... | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27 | \n",
" 30.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" 89534 | \n",
" alomasa02 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 8 | \n",
" 22 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b ... rbi \\\n",
"id ... \n",
"88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n",
"88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n",
"88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n",
"88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n",
"88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n",
"... ... ... ... ... .. ... ... .. ... ... ... ... \n",
"89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 \n",
"89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 \n",
"89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n",
"89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n",
"89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n",
"\n",
" sb cs bb so ibb hbp sh sf gidp \n",
"id \n",
"88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n",
"88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n",
"88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n",
"88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... .. ... ... ... ... ... ... \n",
"89525 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"89526 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n",
"89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n",
"89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
"[100 rows x 22 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball = pd.read_csv(\"data/baseball.csv\", index_col='id')\n",
"baseball"
]
},
{
"cell_type": "markdown",
"id": "b86fe868",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that we specified the `id` column as the index, since it appears to be an unique identifier. We could try to create a unique index ourselves by combining `player` and `year`:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "2aa60839",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" \n",
" \n",
" womacto012006 | \n",
" womacto01 | \n",
" 2006 | \n",
" 2 | \n",
" CHN | \n",
" NL | \n",
" 19 | \n",
" 50 | \n",
" 6 | \n",
" 14 | \n",
" 1 | \n",
" ... | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" schilcu012006 | \n",
" schilcu01 | \n",
" 2006 | \n",
" 1 | \n",
" BOS | \n",
" AL | \n",
" 31 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" myersmi012006 | \n",
" myersmi01 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 62 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" helliri012006 | \n",
" helliri01 | \n",
" 2006 | \n",
" 1 | \n",
" MIL | \n",
" NL | \n",
" 20 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" johnsra052006 | \n",
" johnsra05 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 33 | \n",
" 6 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" benitar012007 | \n",
" benitar01 | \n",
" 2007 | \n",
" 2 | \n",
" FLO | \n",
" NL | \n",
" 34 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" benitar012007 | \n",
" benitar01 | \n",
" 2007 | \n",
" 1 | \n",
" SFN | \n",
" NL | \n",
" 19 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ausmubr012007 | \n",
" ausmubr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 117 | \n",
" 349 | \n",
" 38 | \n",
" 82 | \n",
" 16 | \n",
" ... | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37 | \n",
" 74.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" aloumo012007 | \n",
" aloumo01 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 87 | \n",
" 328 | \n",
" 51 | \n",
" 112 | \n",
" 19 | \n",
" ... | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27 | \n",
" 30.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" alomasa022007 | \n",
" alomasa02 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 8 | \n",
" 22 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b ... \\\n",
"womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n",
"schilcu012006 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... \n",
"myersmi012006 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... \n",
"helliri012006 helliri01 2006 1 MIL NL 20 3 0 0 0 ... \n",
"johnsra052006 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... \n",
"... ... ... ... ... .. ... ... .. ... ... ... \n",
"benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... \n",
"benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... \n",
"ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... \n",
"aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... \n",
"alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... \n",
"\n",
" rbi sb cs bb so ibb hbp sh sf gidp \n",
"womacto012006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n",
"schilcu012006 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n",
"myersmi012006 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"helliri012006 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n",
"johnsra052006 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... .. ... ... ... ... ... ... \n",
"benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"ausmubr012007 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n",
"aloumo012007 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n",
"alomasa022007 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
"[100 rows x 22 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"player_id = baseball.player + baseball.year.astype(str)\n",
"baseball_newind = baseball.copy()\n",
"baseball_newind.index = player_id\n",
"baseball_newind"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "7151a142",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball_newind.index.is_unique"
]
},
{
"cell_type": "markdown",
"id": "eafc2580",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- So, indices need not be unique. Our choice is not unique because some players change teams within years. The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ae49125b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 89534 | \n",
" alomasa02 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 8 | \n",
" 22 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 89533 | \n",
" aloumo01 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 87 | \n",
" 328 | \n",
" 51 | \n",
" 112 | \n",
" 19 | \n",
" ... | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27 | \n",
" 30.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" 89530 | \n",
" ausmubr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 117 | \n",
" 349 | \n",
" 38 | \n",
" 82 | \n",
" 16 | \n",
" ... | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37 | \n",
" 74.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" 89526 | \n",
" benitar01 | \n",
" 2007 | \n",
" 1 | \n",
" SFN | \n",
" NL | \n",
" 19 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 89525 | \n",
" benitar01 | \n",
" 2007 | \n",
" 2 | \n",
" FLO | \n",
" NL | \n",
" 34 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 88650 | \n",
" johnsra05 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 33 | \n",
" 6 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88649 | \n",
" helliri01 | \n",
" 2006 | \n",
" 1 | \n",
" MIL | \n",
" NL | \n",
" 20 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88645 | \n",
" myersmi01 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 62 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88643 | \n",
" schilcu01 | \n",
" 2006 | \n",
" 1 | \n",
" BOS | \n",
" AL | \n",
" 31 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88641 | \n",
" womacto01 | \n",
" 2006 | \n",
" 2 | \n",
" CHN | \n",
" NL | \n",
" 19 | \n",
" 50 | \n",
" 6 | \n",
" 14 | \n",
" 1 | \n",
" ... | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b ... rbi \\\n",
"id ... \n",
"89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n",
"89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n",
"89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n",
"89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 \n",
"89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 \n",
"... ... ... ... ... .. ... ... .. ... ... ... ... \n",
"88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n",
"88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n",
"88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n",
"88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n",
"88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n",
"\n",
" sb cs bb so ibb hbp sh sf gidp \n",
"id \n",
"89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n",
"89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n",
"89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n",
"89526 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"89525 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... .. ... ... ... ... ... ... \n",
"88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n",
"88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n",
"88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n",
"88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n",
"\n",
"[100 rows x 22 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reverse_index = baseball.index[::-1]\n",
"baseball.reindex(reverse_index)"
]
},
{
"cell_type": "markdown",
"id": "fb3fa641",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that the `id` index is not sequential. Say we wanted to populate the table with every `id` value. We could specify and index that is a sequence from the first to the last `id` numbers in the database, and Pandas would fill in the missing data with `NaN` values:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "78d17c34",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 88641 | \n",
" womacto01 | \n",
" 2006.0 | \n",
" 2.0 | \n",
" CHN | \n",
" NL | \n",
" 19.0 | \n",
" 50.0 | \n",
" 6.0 | \n",
" 14.0 | \n",
" 1.0 | \n",
" ... | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4.0 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88642 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88643 | \n",
" schilcu01 | \n",
" 2006.0 | \n",
" 1.0 | \n",
" BOS | \n",
" AL | \n",
" 31.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88644 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88645 | \n",
" myersmi01 | \n",
" 2006.0 | \n",
" 1.0 | \n",
" NYA | \n",
" AL | \n",
" 62.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89529 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 89530 | \n",
" ausmubr01 | \n",
" 2007.0 | \n",
" 1.0 | \n",
" HOU | \n",
" NL | \n",
" 117.0 | \n",
" 349.0 | \n",
" 38.0 | \n",
" 82.0 | \n",
" 16.0 | \n",
" ... | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37.0 | \n",
" 74.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" 89531 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 89532 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 89533 | \n",
" aloumo01 | \n",
" 2007.0 | \n",
" 1.0 | \n",
" NYN | \n",
" NL | \n",
" 87.0 | \n",
" 328.0 | \n",
" 51.0 | \n",
" 112.0 | \n",
" 19.0 | \n",
" ... | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27.0 | \n",
" 30.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
"
\n",
"
893 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b \\\n",
"id \n",
"88641 womacto01 2006.0 2.0 CHN NL 19.0 50.0 6.0 14.0 1.0 \n",
"88642 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"88643 schilcu01 2006.0 1.0 BOS AL 31.0 2.0 0.0 1.0 0.0 \n",
"88644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"88645 myersmi01 2006.0 1.0 NYA AL 62.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... ... ... ... ... ... ... \n",
"89529 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"89530 ausmubr01 2007.0 1.0 HOU NL 117.0 349.0 38.0 82.0 16.0 \n",
"89531 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"89532 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"89533 aloumo01 2007.0 1.0 NYN NL 87.0 328.0 51.0 112.0 19.0 \n",
"\n",
" ... rbi sb cs bb so ibb hbp sh sf gidp \n",
"id ... \n",
"88641 ... 2.0 1.0 1.0 4.0 4.0 0.0 0.0 3.0 0.0 0.0 \n",
"88642 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"88643 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n",
"88644 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"88645 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... ... ... ... ... ... ... ... \n",
"89529 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"89530 ... 25.0 6.0 1.0 37.0 74.0 3.0 6.0 4.0 1.0 11.0 \n",
"89531 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"89532 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"89533 ... 49.0 3.0 0.0 27.0 30.0 5.0 2.0 0.0 3.0 13.0 \n",
"\n",
"[893 rows x 22 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id_range = range(baseball.index.values.min(), baseball.index.values.max())\n",
"baseball.reindex(id_range)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "eccc6c79",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
"
\n",
" \n",
" id | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 88641 | \n",
" womacto01 | \n",
"
\n",
" \n",
" 88642 | \n",
" mr.nobody | \n",
"
\n",
" \n",
" 88643 | \n",
" schilcu01 | \n",
"
\n",
" \n",
" 88644 | \n",
" mr.nobody | \n",
"
\n",
" \n",
" 88645 | \n",
" myersmi01 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89529 | \n",
" mr.nobody | \n",
"
\n",
" \n",
" 89530 | \n",
" ausmubr01 | \n",
"
\n",
" \n",
" 89531 | \n",
" mr.nobody | \n",
"
\n",
" \n",
" 89532 | \n",
" mr.nobody | \n",
"
\n",
" \n",
" 89533 | \n",
" aloumo01 | \n",
"
\n",
" \n",
"
\n",
"
893 rows × 1 columns
\n",
"
"
],
"text/plain": [
" player\n",
"id \n",
"88641 womacto01\n",
"88642 mr.nobody\n",
"88643 schilcu01\n",
"88644 mr.nobody\n",
"88645 myersmi01\n",
"... ...\n",
"89529 mr.nobody\n",
"89530 ausmubr01\n",
"89531 mr.nobody\n",
"89532 mr.nobody\n",
"89533 aloumo01\n",
"\n",
"[893 rows x 1 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.reindex(id_range, fill_value='mr.nobody', columns=['player'])"
]
},
{
"cell_type": "markdown",
"id": "77ea9d6d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Index can also be sorted"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "428064bb",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" \n",
" \n",
" alomasa022007 | \n",
" alomasa02 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 8 | \n",
" 22 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" aloumo012007 | \n",
" aloumo01 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 87 | \n",
" 328 | \n",
" 51 | \n",
" 112 | \n",
" 19 | \n",
" ... | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27 | \n",
" 30.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" ausmubr012007 | \n",
" ausmubr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 117 | \n",
" 349 | \n",
" 38 | \n",
" 82 | \n",
" 16 | \n",
" ... | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37 | \n",
" 74.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" benitar012007 | \n",
" benitar01 | \n",
" 2007 | \n",
" 2 | \n",
" FLO | \n",
" NL | \n",
" 34 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" benitar012007 | \n",
" benitar01 | \n",
" 2007 | \n",
" 1 | \n",
" SFN | \n",
" NL | \n",
" 19 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" wickmbo012007 | \n",
" wickmbo01 | \n",
" 2007 | \n",
" 1 | \n",
" ATL | \n",
" NL | \n",
" 47 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" williwo022007 | \n",
" williwo02 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 33 | \n",
" 59 | \n",
" 3 | \n",
" 6 | \n",
" 0 | \n",
" ... | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 5.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" witasja012007 | \n",
" witasja01 | \n",
" 2007 | \n",
" 1 | \n",
" TBA | \n",
" AL | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" womacto012006 | \n",
" womacto01 | \n",
" 2006 | \n",
" 2 | \n",
" CHN | \n",
" NL | \n",
" 19 | \n",
" 50 | \n",
" 6 | \n",
" 14 | \n",
" 1 | \n",
" ... | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" zaungr012007 | \n",
" zaungr01 | \n",
" 2007 | \n",
" 1 | \n",
" TOR | \n",
" AL | \n",
" 110 | \n",
" 331 | \n",
" 43 | \n",
" 80 | \n",
" 24 | \n",
" ... | \n",
" 52.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 51 | \n",
" 55.0 | \n",
" 8.0 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 6.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b ... \\\n",
"alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... \n",
"aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... \n",
"ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... \n",
"benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... \n",
"benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... \n",
"... ... ... ... ... .. ... ... .. ... ... ... \n",
"wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... \n",
"williwo022007 williwo02 2007 1 HOU NL 33 59 3 6 0 ... \n",
"witasja012007 witasja01 2007 1 TBA AL 3 0 0 0 0 ... \n",
"womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n",
"zaungr012007 zaungr01 2007 1 TOR AL 110 331 43 80 24 ... \n",
"\n",
" rbi sb cs bb so ibb hbp sh sf gidp \n",
"alomasa022007 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n",
"aloumo012007 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n",
"ausmubr012007 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n",
"benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... .. ... ... ... ... ... ... \n",
"wickmbo012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"williwo022007 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0 \n",
"witasja012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"womacto012006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n",
"zaungr012007 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0 \n",
"\n",
"[100 rows x 22 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball_newind.sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "678e30cc",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" \n",
" \n",
" zaungr012007 | \n",
" zaungr01 | \n",
" 2007 | \n",
" 1 | \n",
" TOR | \n",
" AL | \n",
" 110 | \n",
" 331 | \n",
" 43 | \n",
" 80 | \n",
" 24 | \n",
" ... | \n",
" 52.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 51 | \n",
" 55.0 | \n",
" 8.0 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 6.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" womacto012006 | \n",
" womacto01 | \n",
" 2006 | \n",
" 2 | \n",
" CHN | \n",
" NL | \n",
" 19 | \n",
" 50 | \n",
" 6 | \n",
" 14 | \n",
" 1 | \n",
" ... | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" witasja012007 | \n",
" witasja01 | \n",
" 2007 | \n",
" 1 | \n",
" TBA | \n",
" AL | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" williwo022007 | \n",
" williwo02 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 33 | \n",
" 59 | \n",
" 3 | \n",
" 6 | \n",
" 0 | \n",
" ... | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 5.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" wickmbo012007 | \n",
" wickmbo01 | \n",
" 2007 | \n",
" 1 | \n",
" ATL | \n",
" NL | \n",
" 47 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" benitar012007 | \n",
" benitar01 | \n",
" 2007 | \n",
" 2 | \n",
" FLO | \n",
" NL | \n",
" 34 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" benitar012007 | \n",
" benitar01 | \n",
" 2007 | \n",
" 1 | \n",
" SFN | \n",
" NL | \n",
" 19 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ausmubr012007 | \n",
" ausmubr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 117 | \n",
" 349 | \n",
" 38 | \n",
" 82 | \n",
" 16 | \n",
" ... | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37 | \n",
" 74.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" aloumo012007 | \n",
" aloumo01 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 87 | \n",
" 328 | \n",
" 51 | \n",
" 112 | \n",
" 19 | \n",
" ... | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27 | \n",
" 30.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" alomasa022007 | \n",
" alomasa02 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 8 | \n",
" 22 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b ... \\\n",
"zaungr012007 zaungr01 2007 1 TOR AL 110 331 43 80 24 ... \n",
"womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... \n",
"witasja012007 witasja01 2007 1 TBA AL 3 0 0 0 0 ... \n",
"williwo022007 williwo02 2007 1 HOU NL 33 59 3 6 0 ... \n",
"wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... \n",
"... ... ... ... ... .. ... ... .. ... ... ... \n",
"benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... \n",
"benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... \n",
"ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... \n",
"aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... \n",
"alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... \n",
"\n",
" rbi sb cs bb so ibb hbp sh sf gidp \n",
"zaungr012007 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0 \n",
"womacto012006 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n",
"witasja012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"williwo022007 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0 \n",
"wickmbo012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... .. ... ... ... ... ... ... \n",
"benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"benitar012007 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"ausmubr012007 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n",
"aloumo012007 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n",
"alomasa022007 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
"[100 rows x 22 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball_newind.sort_index(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "538484a1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" X2b | \n",
" X3b | \n",
" ab | \n",
" bb | \n",
" cs | \n",
" g | \n",
" gidp | \n",
" h | \n",
" hbp | \n",
" hr | \n",
" ... | \n",
" player | \n",
" r | \n",
" rbi | \n",
" sb | \n",
" sf | \n",
" sh | \n",
" so | \n",
" stint | \n",
" team | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" womacto012006 | \n",
" 1 | \n",
" 0 | \n",
" 50 | \n",
" 4 | \n",
" 1.0 | \n",
" 19 | \n",
" 0.0 | \n",
" 14 | \n",
" 0.0 | \n",
" 1 | \n",
" ... | \n",
" womacto01 | \n",
" 6 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 4.0 | \n",
" 2 | \n",
" CHN | \n",
" 2006 | \n",
"
\n",
" \n",
" schilcu012006 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0.0 | \n",
" 31 | \n",
" 0.0 | \n",
" 1 | \n",
" 0.0 | \n",
" 0 | \n",
" ... | \n",
" schilcu01 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1 | \n",
" BOS | \n",
" 2006 | \n",
"
\n",
" \n",
" myersmi012006 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 62 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0 | \n",
" ... | \n",
" myersmi01 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1 | \n",
" NYA | \n",
" 2006 | \n",
"
\n",
" \n",
" helliri012006 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0.0 | \n",
" 20 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0 | \n",
" ... | \n",
" helliri01 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2.0 | \n",
" 1 | \n",
" MIL | \n",
" 2006 | \n",
"
\n",
" \n",
" johnsra052006 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
" 0 | \n",
" 0.0 | \n",
" 33 | \n",
" 0.0 | \n",
" 1 | \n",
" 0.0 | \n",
" 0 | \n",
" ... | \n",
" johnsra05 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 4.0 | \n",
" 1 | \n",
" NYA | \n",
" 2006 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" benitar012007 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 34 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0 | \n",
" ... | \n",
" benitar01 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2 | \n",
" FLO | \n",
" 2007 | \n",
"
\n",
" \n",
" benitar012007 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 19 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0 | \n",
" ... | \n",
" benitar01 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1 | \n",
" SFN | \n",
" 2007 | \n",
"
\n",
" \n",
" ausmubr012007 | \n",
" 16 | \n",
" 3 | \n",
" 349 | \n",
" 37 | \n",
" 1.0 | \n",
" 117 | \n",
" 11.0 | \n",
" 82 | \n",
" 6.0 | \n",
" 3 | \n",
" ... | \n",
" ausmubr01 | \n",
" 38 | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 4.0 | \n",
" 74.0 | \n",
" 1 | \n",
" HOU | \n",
" 2007 | \n",
"
\n",
" \n",
" aloumo012007 | \n",
" 19 | \n",
" 1 | \n",
" 328 | \n",
" 27 | \n",
" 0.0 | \n",
" 87 | \n",
" 13.0 | \n",
" 112 | \n",
" 2.0 | \n",
" 13 | \n",
" ... | \n",
" aloumo01 | \n",
" 51 | \n",
" 49.0 | \n",
" 3.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 30.0 | \n",
" 1 | \n",
" NYN | \n",
" 2007 | \n",
"
\n",
" \n",
" alomasa022007 | \n",
" 1 | \n",
" 0 | \n",
" 22 | \n",
" 0 | \n",
" 0.0 | \n",
" 8 | \n",
" 0.0 | \n",
" 3 | \n",
" 0.0 | \n",
" 0 | \n",
" ... | \n",
" alomasa02 | \n",
" 1 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 1 | \n",
" NYN | \n",
" 2007 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 22 columns
\n",
"
"
],
"text/plain": [
" X2b X3b ab bb cs g gidp h hbp hr ... \\\n",
"womacto012006 1 0 50 4 1.0 19 0.0 14 0.0 1 ... \n",
"schilcu012006 0 0 2 0 0.0 31 0.0 1 0.0 0 ... \n",
"myersmi012006 0 0 0 0 0.0 62 0.0 0 0.0 0 ... \n",
"helliri012006 0 0 3 0 0.0 20 0.0 0 0.0 0 ... \n",
"johnsra052006 0 0 6 0 0.0 33 0.0 1 0.0 0 ... \n",
"... ... ... ... .. ... ... ... ... ... .. ... \n",
"benitar012007 0 0 0 0 0.0 34 0.0 0 0.0 0 ... \n",
"benitar012007 0 0 0 0 0.0 19 0.0 0 0.0 0 ... \n",
"ausmubr012007 16 3 349 37 1.0 117 11.0 82 6.0 3 ... \n",
"aloumo012007 19 1 328 27 0.0 87 13.0 112 2.0 13 ... \n",
"alomasa022007 1 0 22 0 0.0 8 0.0 3 0.0 0 ... \n",
"\n",
" player r rbi sb sf sh so stint team year \n",
"womacto012006 womacto01 6 2.0 1.0 0.0 3.0 4.0 2 CHN 2006 \n",
"schilcu012006 schilcu01 0 0.0 0.0 0.0 0.0 1.0 1 BOS 2006 \n",
"myersmi012006 myersmi01 0 0.0 0.0 0.0 0.0 0.0 1 NYA 2006 \n",
"helliri012006 helliri01 0 0.0 0.0 0.0 0.0 2.0 1 MIL 2006 \n",
"johnsra052006 johnsra05 0 0.0 0.0 0.0 0.0 4.0 1 NYA 2006 \n",
"... ... .. ... ... ... ... ... ... ... ... \n",
"benitar012007 benitar01 0 0.0 0.0 0.0 0.0 0.0 2 FLO 2007 \n",
"benitar012007 benitar01 0 0.0 0.0 0.0 0.0 0.0 1 SFN 2007 \n",
"ausmubr012007 ausmubr01 38 25.0 6.0 1.0 4.0 74.0 1 HOU 2007 \n",
"aloumo012007 aloumo01 51 49.0 3.0 3.0 0.0 30.0 1 NYN 2007 \n",
"alomasa022007 alomasa02 1 0.0 0.0 0.0 0.0 3.0 1 NYN 2007 \n",
"\n",
"[100 rows x 22 columns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball_newind.sort_index(axis=1)"
]
},
{
"cell_type": "markdown",
"id": "5e8efd07",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- **Ranking** does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "ff845a68",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"id\n",
"88641 62.5\n",
"88643 29.0\n",
"88645 29.0\n",
"88649 29.0\n",
"88650 29.0\n",
" ... \n",
"89525 29.0\n",
"89526 29.0\n",
"89530 71.5\n",
"89533 88.0\n",
"89534 29.0\n",
"Name: hr, Length: 100, dtype: float64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.hr.rank()"
]
},
{
"cell_type": "markdown",
"id": "b628dbea",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "a46f31a9",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"id\n",
"88641 58.0\n",
"88643 1.0\n",
"88645 2.0\n",
"88649 3.0\n",
"88650 4.0\n",
" ... \n",
"89525 55.0\n",
"89526 56.0\n",
"89530 72.0\n",
"89533 88.0\n",
"89534 57.0\n",
"Name: hr, Length: 100, dtype: float64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.hr.rank(method='first')"
]
},
{
"cell_type": "markdown",
"id": "852570c6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Calling the `DataFrame`'s `rank` method results in the ranks of all columns:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "8f6c74b5",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 88641 | \n",
" 2.0 | \n",
" 96.5 | \n",
" 7.0 | \n",
" 82.0 | \n",
" 31.5 | \n",
" 70.0 | \n",
" 47.5 | \n",
" 40.5 | \n",
" 39.0 | \n",
" 50.5 | \n",
" ... | \n",
" 51.0 | \n",
" 24.5 | \n",
" 17.5 | \n",
" 44.5 | \n",
" 59.0 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 16.0 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
" 88643 | \n",
" 37.5 | \n",
" 96.5 | \n",
" 57.0 | \n",
" 88.0 | \n",
" 81.5 | \n",
" 55.5 | \n",
" 73.0 | \n",
" 81.0 | \n",
" 63.5 | \n",
" 78.0 | \n",
" ... | \n",
" 78.5 | \n",
" 63.5 | \n",
" 62.5 | \n",
" 79.0 | \n",
" 73.0 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 67.5 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
" 88645 | \n",
" 47.5 | \n",
" 96.5 | \n",
" 57.0 | \n",
" 40.5 | \n",
" 81.5 | \n",
" 36.0 | \n",
" 91.0 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 78.0 | \n",
" ... | \n",
" 78.5 | \n",
" 63.5 | \n",
" 62.5 | \n",
" 79.0 | \n",
" 89.0 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 67.5 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
" 88649 | \n",
" 66.0 | \n",
" 96.5 | \n",
" 57.0 | \n",
" 47.0 | \n",
" 31.5 | \n",
" 67.5 | \n",
" 69.0 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 78.0 | \n",
" ... | \n",
" 78.5 | \n",
" 63.5 | \n",
" 62.5 | \n",
" 79.0 | \n",
" 67.0 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 67.5 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
" 88650 | \n",
" 61.5 | \n",
" 96.5 | \n",
" 57.0 | \n",
" 40.5 | \n",
" 81.5 | \n",
" 51.0 | \n",
" 64.5 | \n",
" 81.0 | \n",
" 63.5 | \n",
" 78.0 | \n",
" ... | \n",
" 78.5 | \n",
" 63.5 | \n",
" 62.5 | \n",
" 79.0 | \n",
" 59.0 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 67.5 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89525 | \n",
" 96.5 | \n",
" 46.5 | \n",
" 7.0 | \n",
" 64.0 | \n",
" 31.5 | \n",
" 47.0 | \n",
" 91.0 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 78.0 | \n",
" ... | \n",
" 78.5 | \n",
" 63.5 | \n",
" 62.5 | \n",
" 79.0 | \n",
" 89.0 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 67.5 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
" 89526 | \n",
" 96.5 | \n",
" 46.5 | \n",
" 57.0 | \n",
" 13.5 | \n",
" 31.5 | \n",
" 70.0 | \n",
" 91.0 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 78.0 | \n",
" ... | \n",
" 78.5 | \n",
" 63.5 | \n",
" 62.5 | \n",
" 79.0 | \n",
" 89.0 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 67.5 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
" 89530 | \n",
" 98.0 | \n",
" 46.5 | \n",
" 57.0 | \n",
" 61.5 | \n",
" 31.5 | \n",
" 17.5 | \n",
" 19.0 | \n",
" 24.0 | \n",
" 23.0 | \n",
" 21.5 | \n",
" ... | \n",
" 27.0 | \n",
" 7.0 | \n",
" 17.5 | \n",
" 18.5 | \n",
" 10.0 | \n",
" 18.0 | \n",
" 6.5 | \n",
" 12.0 | \n",
" 33.5 | \n",
" 14.0 | \n",
"
\n",
" \n",
" 89533 | \n",
" 99.0 | \n",
" 46.5 | \n",
" 57.0 | \n",
" 31.5 | \n",
" 31.5 | \n",
" 23.0 | \n",
" 22.0 | \n",
" 18.5 | \n",
" 14.0 | \n",
" 17.5 | \n",
" ... | \n",
" 18.0 | \n",
" 14.0 | \n",
" 62.5 | \n",
" 22.0 | \n",
" 27.0 | \n",
" 11.0 | \n",
" 21.0 | \n",
" 67.5 | \n",
" 15.5 | \n",
" 10.5 | \n",
"
\n",
" \n",
" 89534 | \n",
" 100.0 | \n",
" 46.5 | \n",
" 57.0 | \n",
" 31.5 | \n",
" 31.5 | \n",
" 77.0 | \n",
" 57.0 | \n",
" 58.0 | \n",
" 58.0 | \n",
" 50.5 | \n",
" ... | \n",
" 78.5 | \n",
" 63.5 | \n",
" 62.5 | \n",
" 79.0 | \n",
" 63.5 | \n",
" 66.0 | \n",
" 65.5 | \n",
" 67.5 | \n",
" 70.0 | \n",
" 76.5 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b ... \\\n",
"id ... \n",
"88641 2.0 96.5 7.0 82.0 31.5 70.0 47.5 40.5 39.0 50.5 ... \n",
"88643 37.5 96.5 57.0 88.0 81.5 55.5 73.0 81.0 63.5 78.0 ... \n",
"88645 47.5 96.5 57.0 40.5 81.5 36.0 91.0 81.0 84.5 78.0 ... \n",
"88649 66.0 96.5 57.0 47.0 31.5 67.5 69.0 81.0 84.5 78.0 ... \n",
"88650 61.5 96.5 57.0 40.5 81.5 51.0 64.5 81.0 63.5 78.0 ... \n",
"... ... ... ... ... ... ... ... ... ... ... ... \n",
"89525 96.5 46.5 7.0 64.0 31.5 47.0 91.0 81.0 84.5 78.0 ... \n",
"89526 96.5 46.5 57.0 13.5 31.5 70.0 91.0 81.0 84.5 78.0 ... \n",
"89530 98.0 46.5 57.0 61.5 31.5 17.5 19.0 24.0 23.0 21.5 ... \n",
"89533 99.0 46.5 57.0 31.5 31.5 23.0 22.0 18.5 14.0 17.5 ... \n",
"89534 100.0 46.5 57.0 31.5 31.5 77.0 57.0 58.0 58.0 50.5 ... \n",
"\n",
" rbi sb cs bb so ibb hbp sh sf gidp \n",
"id \n",
"88641 51.0 24.5 17.5 44.5 59.0 66.0 65.5 16.0 70.0 76.5 \n",
"88643 78.5 63.5 62.5 79.0 73.0 66.0 65.5 67.5 70.0 76.5 \n",
"88645 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5 \n",
"88649 78.5 63.5 62.5 79.0 67.0 66.0 65.5 67.5 70.0 76.5 \n",
"88650 78.5 63.5 62.5 79.0 59.0 66.0 65.5 67.5 70.0 76.5 \n",
"... ... ... ... ... ... ... ... ... ... ... \n",
"89525 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5 \n",
"89526 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5 \n",
"89530 27.0 7.0 17.5 18.5 10.0 18.0 6.5 12.0 33.5 14.0 \n",
"89533 18.0 14.0 62.5 22.0 27.0 11.0 21.0 67.5 15.5 10.5 \n",
"89534 78.5 63.5 62.5 79.0 63.5 66.0 65.5 67.5 70.0 76.5 \n",
"\n",
"[100 rows x 22 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.rank(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "4d53e292",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" r | \n",
" h | \n",
" hr | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 88641 | \n",
" 40.5 | \n",
" 39.0 | \n",
" 38.5 | \n",
"
\n",
" \n",
" 88643 | \n",
" 81.0 | \n",
" 63.5 | \n",
" 72.0 | \n",
"
\n",
" \n",
" 88645 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 72.0 | \n",
"
\n",
" \n",
" 88649 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 72.0 | \n",
"
\n",
" \n",
" 88650 | \n",
" 81.0 | \n",
" 63.5 | \n",
" 72.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89525 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 72.0 | \n",
"
\n",
" \n",
" 89526 | \n",
" 81.0 | \n",
" 84.5 | \n",
" 72.0 | \n",
"
\n",
" \n",
" 89530 | \n",
" 24.0 | \n",
" 23.0 | \n",
" 29.5 | \n",
"
\n",
" \n",
" 89533 | \n",
" 18.5 | \n",
" 14.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" 89534 | \n",
" 58.0 | \n",
" 58.0 | \n",
" 72.0 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 3 columns
\n",
"
"
],
"text/plain": [
" r h hr\n",
"id \n",
"88641 40.5 39.0 38.5\n",
"88643 81.0 63.5 72.0\n",
"88645 81.0 84.5 72.0\n",
"88649 81.0 84.5 72.0\n",
"88650 81.0 63.5 72.0\n",
"... ... ... ...\n",
"89525 81.0 84.5 72.0\n",
"89526 81.0 84.5 72.0\n",
"89530 24.0 23.0 29.5\n",
"89533 18.5 14.0 13.0\n",
"89534 58.0 58.0 72.0\n",
"\n",
"[100 rows x 3 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball[['r','h','hr']].rank(ascending=False)"
]
},
{
"cell_type": "markdown",
"id": "57a9a942",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Remove rows or columns via the `drop` method:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "369a5cd0",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(100, 22)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.shape"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "7cf17ce7",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" ... | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" ibb | \n",
" hbp | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 88641 | \n",
" womacto01 | \n",
" 2006 | \n",
" 2 | \n",
" CHN | \n",
" NL | \n",
" 19 | \n",
" 50 | \n",
" 6 | \n",
" 14 | \n",
" 1 | \n",
" ... | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88643 | \n",
" schilcu01 | \n",
" 2006 | \n",
" 1 | \n",
" BOS | \n",
" AL | \n",
" 31 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88645 | \n",
" myersmi01 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 62 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88649 | \n",
" helliri01 | \n",
" 2006 | \n",
" 1 | \n",
" MIL | \n",
" NL | \n",
" 20 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88650 | \n",
" johnsra05 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 33 | \n",
" 6 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89521 | \n",
" bondsba01 | \n",
" 2007 | \n",
" 1 | \n",
" SFN | \n",
" NL | \n",
" 126 | \n",
" 340 | \n",
" 75 | \n",
" 94 | \n",
" 14 | \n",
" ... | \n",
" 66.0 | \n",
" 5.0 | \n",
" 0.0 | \n",
" 132 | \n",
" 54.0 | \n",
" 43.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 2.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" 89523 | \n",
" biggicr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 141 | \n",
" 517 | \n",
" 68 | \n",
" 130 | \n",
" 31 | \n",
" ... | \n",
" 50.0 | \n",
" 4.0 | \n",
" 3.0 | \n",
" 23 | \n",
" 112.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 7.0 | \n",
" 5.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" 89530 | \n",
" ausmubr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 117 | \n",
" 349 | \n",
" 38 | \n",
" 82 | \n",
" 16 | \n",
" ... | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37 | \n",
" 74.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" 89533 | \n",
" aloumo01 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 87 | \n",
" 328 | \n",
" 51 | \n",
" 112 | \n",
" 19 | \n",
" ... | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27 | \n",
" 30.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" 89534 | \n",
" alomasa02 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 8 | \n",
" 22 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
98 rows × 22 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b ... rbi \\\n",
"id ... \n",
"88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 \n",
"88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 \n",
"88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 \n",
"88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 \n",
"88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 \n",
"... ... ... ... ... .. ... ... .. ... ... ... ... \n",
"89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 ... 66.0 \n",
"89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 \n",
"89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 \n",
"89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 \n",
"89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 \n",
"\n",
" sb cs bb so ibb hbp sh sf gidp \n",
"id \n",
"88641 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0 \n",
"88643 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n",
"88645 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"88649 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n",
"88650 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"89521 5.0 0.0 132 54.0 43.0 3.0 0.0 2.0 13.0 \n",
"89523 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0 \n",
"89530 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0 \n",
"89533 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0 \n",
"89534 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
"[98 rows x 22 columns]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.drop([89525, 89526]) # does not modify the original DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "859860b5",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(100, 22)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.shape"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "27a23cdd",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" player | \n",
" year | \n",
" stint | \n",
" team | \n",
" lg | \n",
" g | \n",
" ab | \n",
" r | \n",
" h | \n",
" X2b | \n",
" X3b | \n",
" hr | \n",
" rbi | \n",
" sb | \n",
" cs | \n",
" bb | \n",
" so | \n",
" sh | \n",
" sf | \n",
" gidp | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 88641 | \n",
" womacto01 | \n",
" 2006 | \n",
" 2 | \n",
" CHN | \n",
" NL | \n",
" 19 | \n",
" 50 | \n",
" 6 | \n",
" 14 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4 | \n",
" 4.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88643 | \n",
" schilcu01 | \n",
" 2006 | \n",
" 1 | \n",
" BOS | \n",
" AL | \n",
" 31 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88645 | \n",
" myersmi01 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 62 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88649 | \n",
" helliri01 | \n",
" 2006 | \n",
" 1 | \n",
" MIL | \n",
" NL | \n",
" 20 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 88650 | \n",
" johnsra05 | \n",
" 2006 | \n",
" 1 | \n",
" NYA | \n",
" AL | \n",
" 33 | \n",
" 6 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 4.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 89525 | \n",
" benitar01 | \n",
" 2007 | \n",
" 2 | \n",
" FLO | \n",
" NL | \n",
" 34 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 89526 | \n",
" benitar01 | \n",
" 2007 | \n",
" 1 | \n",
" SFN | \n",
" NL | \n",
" 19 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 89530 | \n",
" ausmubr01 | \n",
" 2007 | \n",
" 1 | \n",
" HOU | \n",
" NL | \n",
" 117 | \n",
" 349 | \n",
" 38 | \n",
" 82 | \n",
" 16 | \n",
" 3 | \n",
" 3 | \n",
" 25.0 | \n",
" 6.0 | \n",
" 1.0 | \n",
" 37 | \n",
" 74.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
" 89533 | \n",
" aloumo01 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 87 | \n",
" 328 | \n",
" 51 | \n",
" 112 | \n",
" 19 | \n",
" 1 | \n",
" 13 | \n",
" 49.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 27 | \n",
" 30.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 13.0 | \n",
"
\n",
" \n",
" 89534 | \n",
" alomasa02 | \n",
" 2007 | \n",
" 1 | \n",
" NYN | \n",
" NL | \n",
" 8 | \n",
" 22 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 20 columns
\n",
"
"
],
"text/plain": [
" player year stint team lg g ab r h X2b X3b hr rbi \\\n",
"id \n",
"88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 \n",
"88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 \n",
"88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 0 0 0.0 \n",
"88649 helliri01 2006 1 MIL NL 20 3 0 0 0 0 0 0.0 \n",
"88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 0 0 0.0 \n",
"... ... ... ... ... .. ... ... .. ... ... ... .. ... \n",
"89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0 0 0.0 \n",
"89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0 0 0.0 \n",
"89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3 3 25.0 \n",
"89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 \n",
"89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 \n",
"\n",
" sb cs bb so sh sf gidp \n",
"id \n",
"88641 1.0 1.0 4 4.0 3.0 0.0 0.0 \n",
"88643 0.0 0.0 0 1.0 0.0 0.0 0.0 \n",
"88645 0.0 0.0 0 0.0 0.0 0.0 0.0 \n",
"88649 0.0 0.0 0 2.0 0.0 0.0 0.0 \n",
"88650 0.0 0.0 0 4.0 0.0 0.0 0.0 \n",
"... ... ... .. ... ... ... ... \n",
"89525 0.0 0.0 0 0.0 0.0 0.0 0.0 \n",
"89526 0.0 0.0 0 0.0 0.0 0.0 0.0 \n",
"89530 6.0 1.0 37 74.0 4.0 1.0 11.0 \n",
"89533 3.0 0.0 27 30.0 0.0 3.0 13.0 \n",
"89534 0.0 0.0 0 3.0 0.0 0.0 0.0 \n",
"\n",
"[100 rows x 20 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball.drop(['ibb','hbp'], axis=1) # Pandas axis=0 indicating row, axis=1 indicating column"
]
},
{
"cell_type": "markdown",
"id": "8dd9963f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Join two Pandas DataFrames\n",
"\n",
"`DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.\n",
"\n",
"For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "4cc5e9dc",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"hr2006 = baseball[baseball.year==2006].xs('hr', axis=1)\n",
"hr2006.index = baseball.player[baseball.year==2006]\n",
"\n",
"hr2007 = baseball[baseball.year==2007].xs('hr', axis=1)\n",
"hr2007.index = baseball.player[baseball.year==2007]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "63d4c793",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"player\n",
"womacto01 1\n",
"schilcu01 0\n",
"myersmi01 0\n",
"helliri01 0\n",
"johnsra05 0\n",
"finlest01 6\n",
"gonzalu01 15\n",
"seleaa01 0\n",
"dtype: int64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hr2006 = pd.Series(baseball.hr[baseball.year==2006].values, index=baseball.player[baseball.year==2006])\n",
"hr2006"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "c99cc635",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"player\n",
"francju01 0\n",
"francju01 1\n",
"zaungr01 10\n",
"witasja01 0\n",
"williwo02 1\n",
" ..\n",
"benitar01 0\n",
"benitar01 0\n",
"ausmubr01 3\n",
"aloumo01 13\n",
"alomasa02 0\n",
"Length: 92, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hr2007 = pd.Series(baseball.hr[baseball.year==2007].values, index=baseball.player[baseball.year==2007])\n",
"hr2007"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "8dafdacc",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"player\n",
"alomasa02 NaN\n",
"aloumo01 NaN\n",
"ausmubr01 NaN\n",
"benitar01 NaN\n",
"benitar01 NaN\n",
" ..\n",
"wickmbo01 NaN\n",
"williwo02 NaN\n",
"witasja01 NaN\n",
"womacto01 NaN\n",
"zaungr01 NaN\n",
"Length: 94, dtype: float64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hr_total = hr2006 + hr2007 \n",
"hr_total"
]
},
{
"cell_type": "markdown",
"id": "f08cb529",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Pandas' data alignment places `NaN` values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "21eed4d9",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"player\n",
"finlest01 7.0\n",
"gonzalu01 30.0\n",
"johnsra05 0.0\n",
"myersmi01 0.0\n",
"schilcu01 0.0\n",
"seleaa01 0.0\n",
"dtype: float64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hr_total[hr_total.notnull()]"
]
},
{
"cell_type": "markdown",
"id": "af9a9d27",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Merging and joining DataFrame objects\n",
"\n",
"In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the Automatic Identification System (AIS), a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed. \n",
"\n",
"The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. AIS signals are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both spatial and temporal information.\n",
"\n",
"For our purposes, we will use summarized data that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "4ff49465",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mmsi | \n",
" name | \n",
" transit | \n",
" segment | \n",
" seg_length | \n",
" avg_sog | \n",
" min_sog | \n",
" max_sog | \n",
" pdgt10 | \n",
" st_time | \n",
" end_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Us Govt Ves | \n",
" 1 | \n",
" 1 | \n",
" 5.1 | \n",
" 13.2 | \n",
" 9.2 | \n",
" 14.5 | \n",
" 96.5 | \n",
" 2/10/09 16:03 | \n",
" 2/10/09 16:27 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Dredge Capt Frank | \n",
" 1 | \n",
" 1 | \n",
" 13.5 | \n",
" 18.6 | \n",
" 10.4 | \n",
" 20.6 | \n",
" 100.0 | \n",
" 4/6/09 14:31 | \n",
" 4/6/09 15:20 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Us Gov Vessel | \n",
" 1 | \n",
" 1 | \n",
" 4.3 | \n",
" 16.2 | \n",
" 10.3 | \n",
" 20.5 | \n",
" 100.0 | \n",
" 4/6/09 14:36 | \n",
" 4/6/09 14:55 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" Us Gov Vessel | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.5 | \n",
" 16.1 | \n",
" 100.0 | \n",
" 4/10/09 17:58 | \n",
" 4/10/09 18:34 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Dredge Capt Frank | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.6 | \n",
" 16.2 | \n",
" 100.0 | \n",
" 4/10/09 17:59 | \n",
" 4/10/09 18:35 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 262521 | \n",
" 999999999 | \n",
" Triple Attraction | \n",
" 3 | \n",
" 1 | \n",
" 5.3 | \n",
" 20.0 | \n",
" 19.6 | \n",
" 20.4 | \n",
" 100.0 | \n",
" 6/15/10 12:49 | \n",
" 6/15/10 13:05 | \n",
"
\n",
" \n",
" 262522 | \n",
" 999999999 | \n",
" Triple Attraction | \n",
" 4 | \n",
" 1 | \n",
" 18.7 | \n",
" 19.2 | \n",
" 18.4 | \n",
" 19.9 | \n",
" 100.0 | \n",
" 6/15/10 21:32 | \n",
" 6/15/10 22:29 | \n",
"
\n",
" \n",
" 262523 | \n",
" 999999999 | \n",
" Triple Attraction | \n",
" 6 | \n",
" 1 | \n",
" 17.4 | \n",
" 17.0 | \n",
" 14.7 | \n",
" 18.4 | \n",
" 100.0 | \n",
" 6/17/10 19:16 | \n",
" 6/17/10 20:17 | \n",
"
\n",
" \n",
" 262524 | \n",
" 999999999 | \n",
" Triple Attraction | \n",
" 7 | \n",
" 1 | \n",
" 31.5 | \n",
" 14.2 | \n",
" 13.4 | \n",
" 15.1 | \n",
" 100.0 | \n",
" 6/18/10 2:52 | \n",
" 6/18/10 5:03 | \n",
"
\n",
" \n",
" 262525 | \n",
" 999999999 | \n",
" Triple Attraction | \n",
" 8 | \n",
" 1 | \n",
" 19.8 | \n",
" 18.6 | \n",
" 16.1 | \n",
" 19.5 | \n",
" 100.0 | \n",
" 6/18/10 10:19 | \n",
" 6/18/10 11:22 | \n",
"
\n",
" \n",
"
\n",
"
262526 rows × 11 columns
\n",
"
"
],
"text/plain": [
" mmsi name transit segment seg_length avg_sog \\\n",
"0 1 Us Govt Ves 1 1 5.1 13.2 \n",
"1 1 Dredge Capt Frank 1 1 13.5 18.6 \n",
"2 1 Us Gov Vessel 1 1 4.3 16.2 \n",
"3 1 Us Gov Vessel 2 1 9.2 15.4 \n",
"4 1 Dredge Capt Frank 2 1 9.2 15.4 \n",
"... ... ... ... ... ... ... \n",
"262521 999999999 Triple Attraction 3 1 5.3 20.0 \n",
"262522 999999999 Triple Attraction 4 1 18.7 19.2 \n",
"262523 999999999 Triple Attraction 6 1 17.4 17.0 \n",
"262524 999999999 Triple Attraction 7 1 31.5 14.2 \n",
"262525 999999999 Triple Attraction 8 1 19.8 18.6 \n",
"\n",
" min_sog max_sog pdgt10 st_time end_time \n",
"0 9.2 14.5 96.5 2/10/09 16:03 2/10/09 16:27 \n",
"1 10.4 20.6 100.0 4/6/09 14:31 4/6/09 15:20 \n",
"2 10.3 20.5 100.0 4/6/09 14:36 4/6/09 14:55 \n",
"3 14.5 16.1 100.0 4/10/09 17:58 4/10/09 18:34 \n",
"4 14.6 16.2 100.0 4/10/09 17:59 4/10/09 18:35 \n",
"... ... ... ... ... ... \n",
"262521 19.6 20.4 100.0 6/15/10 12:49 6/15/10 13:05 \n",
"262522 18.4 19.9 100.0 6/15/10 21:32 6/15/10 22:29 \n",
"262523 14.7 18.4 100.0 6/17/10 19:16 6/17/10 20:17 \n",
"262524 13.4 15.1 100.0 6/18/10 2:52 6/18/10 5:03 \n",
"262525 16.1 19.5 100.0 6/18/10 10:19 6/18/10 11:22 \n",
"\n",
"[262526 rows x 11 columns]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segments = pd.read_csv(\"data/AIS/transit_segments.csv\")\n",
"segments"
]
},
{
"cell_type": "markdown",
"id": "90181955",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- In addition to the behavior of each vessel, we may want a little more information regarding the vessels themselves. In the `data/AIS` folder there is a second table that contains information about each of the ships that traveled the segments in the `segments` table."
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "827cc4e1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" num_names | \n",
" names | \n",
" sov | \n",
" flag | \n",
" flag_type | \n",
" num_loas | \n",
" loa | \n",
" max_loa | \n",
" num_types | \n",
" type | \n",
"
\n",
" \n",
" mmsi | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
"
\n",
" \n",
" 9 | \n",
" 3 | \n",
" 000000009/Raven/Shearwater | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 2 | \n",
" 50.0/62.0 | \n",
" 62.0 | \n",
" 2 | \n",
" Pleasure/Tug | \n",
"
\n",
" \n",
" 21 | \n",
" 1 | \n",
" Us Gov Vessel | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 208.0 | \n",
" 208.0 | \n",
" 1 | \n",
" Unknown | \n",
"
\n",
" \n",
" 74 | \n",
" 2 | \n",
" Mcfaul/Sarah Bell | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 155.0 | \n",
" 155.0 | \n",
" 1 | \n",
" Unknown | \n",
"
\n",
" \n",
" 103 | \n",
" 3 | \n",
" Ron G/Us Navy Warship 103/Us Warship 103 | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 2 | \n",
" 26.0/155.0 | \n",
" 155.0 | \n",
" 2 | \n",
" Tanker/Unknown | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 919191919 | \n",
" 1 | \n",
" Oi | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 20.0 | \n",
" 20.0 | \n",
" 1 | \n",
" Pleasure | \n",
"
\n",
" \n",
" 967191190 | \n",
" 1 | \n",
" Pathfinder | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 31.0 | \n",
" 31.0 | \n",
" 2 | \n",
" BigTow/Towing | \n",
"
\n",
" \n",
" 975318642 | \n",
" 1 | \n",
" Island Express | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 20.0 | \n",
" 20.0 | \n",
" 1 | \n",
" Towing | \n",
"
\n",
" \n",
" 987654321 | \n",
" 2 | \n",
" Island Lookout/Island Tide | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 2 | \n",
" 22.0/23.0 | \n",
" 23.0 | \n",
" 2 | \n",
" Fishing/Towing | \n",
"
\n",
" \n",
" 999999999 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
"
\n",
" \n",
"
\n",
"
10771 rows × 10 columns
\n",
"
"
],
"text/plain": [
" num_names names sov \\\n",
"mmsi \n",
"1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"9 3 000000009/Raven/Shearwater N \n",
"21 1 Us Gov Vessel Y \n",
"74 2 Mcfaul/Sarah Bell N \n",
"103 3 Ron G/Us Navy Warship 103/Us Warship 103 Y \n",
"... ... ... .. \n",
"919191919 1 Oi N \n",
"967191190 1 Pathfinder N \n",
"975318642 1 Island Express N \n",
"987654321 2 Island Lookout/Island Tide N \n",
"999999999 1 Triple Attraction N \n",
"\n",
" flag flag_type num_loas loa \\\n",
"mmsi \n",
"1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"9 Unknown Unknown 2 50.0/62.0 \n",
"21 Unknown Unknown 1 208.0 \n",
"74 Unknown Unknown 1 155.0 \n",
"103 Unknown Unknown 2 26.0/155.0 \n",
"... ... ... ... ... \n",
"919191919 Unknown Unknown 1 20.0 \n",
"967191190 Unknown Unknown 1 31.0 \n",
"975318642 Unknown Unknown 1 20.0 \n",
"987654321 Unknown Unknown 2 22.0/23.0 \n",
"999999999 Unknown Unknown 1 30.0 \n",
"\n",
" max_loa num_types type \n",
"mmsi \n",
"1 156.0 4 Dredging/MilOps/Reserved/Towing \n",
"9 62.0 2 Pleasure/Tug \n",
"21 208.0 1 Unknown \n",
"74 155.0 1 Unknown \n",
"103 155.0 2 Tanker/Unknown \n",
"... ... ... ... \n",
"919191919 20.0 1 Pleasure \n",
"967191190 31.0 2 BigTow/Towing \n",
"975318642 20.0 1 Towing \n",
"987654321 23.0 2 Fishing/Towing \n",
"999999999 30.0 1 Pleasure \n",
"\n",
"[10771 rows x 10 columns]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vessels = pd.read_csv(\"data/AIS/vessel_information.csv\", index_col='mmsi')\n",
"vessels"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "fd4a43ac",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Cargo 5622\n",
"Tanker 2440\n",
"Pleasure 601\n",
"Tug 221\n",
"Sailing 205\n",
" ... \n",
"AntiPol/Other 1\n",
"Fishing/Law 1\n",
"Cargo/Other/Towing 1\n",
"Cargo/Fishing 1\n",
"Fishing/Reserved/Towing 1\n",
"Name: type, Length: 206, dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vessels.type.value_counts()"
]
},
{
"cell_type": "markdown",
"id": "55e31328",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- The challenge, however, is that several ships have travelled multiple segments, so there is not a one-to-one relationship between the rows of the two tables. The table of vessel information has a *one-to-many* relationship with the segments.\n",
"\n",
"- In Pandas, we can combine tables according to the value of one or more *keys* that are used to identify rows, much like an index. Using a trivial example:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "fdb16698",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"df1 = pd.DataFrame(dict(id=range(4), age=np.random.randint(18, 31, size=4)))\n",
"df2 = pd.DataFrame(dict(id=range(6), score=np.random.random(size=6)))"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "22977cbb",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" age | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 24 | \n",
" 0.666117 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 23 | \n",
" 0.249913 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 26 | \n",
" 0.261396 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 30 | \n",
" 0.651306 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id age score\n",
"0 0 24 0.666117\n",
"1 1 23 0.249913\n",
"2 2 26 0.261396\n",
"3 3 30 0.651306"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df1, df2)"
]
},
{
"cell_type": "markdown",
"id": "1d27e510",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that without any information about which column to use as a key, Pandas did the right thing and used the `id` column in both tables. Unless specified otherwise, `merge` will used any common column names as keys for merging the tables. \n",
"\n",
"- By default, `merge` performs an **inner join** on the tables, meaning that the merged table represents an intersection of the two tables."
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "dab183d4",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" age | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 24.0 | \n",
" 0.666117 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 23.0 | \n",
" 0.249913 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 26.0 | \n",
" 0.261396 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 30.0 | \n",
" 0.651306 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" NaN | \n",
" 0.468143 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" NaN | \n",
" 0.461524 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id age score\n",
"0 0 24.0 0.666117\n",
"1 1 23.0 0.249913\n",
"2 2 26.0 0.261396\n",
"3 3 30.0 0.651306\n",
"4 4 NaN 0.468143\n",
"5 5 NaN 0.461524"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df1, df2, how='outer')"
]
},
{
"cell_type": "markdown",
"id": "2d8409a8",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- The **outer join** above yields the union of the two tables, so all rows are represented, with missing values inserted as appropriate. One can also perform **right** and **left** joins to include all rows of the right or left table (*i.e.* first or second argument to `merge`), but not necessarily the other.\n",
"\n",
"- Looking at the two datasets that we wish to merge:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "32ceffa7",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mmsi | \n",
" name | \n",
" transit | \n",
" segment | \n",
" seg_length | \n",
" avg_sog | \n",
" min_sog | \n",
" max_sog | \n",
" pdgt10 | \n",
" st_time | \n",
" end_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Us Govt Ves | \n",
" 1 | \n",
" 1 | \n",
" 5.1 | \n",
" 13.2 | \n",
" 9.2 | \n",
" 14.5 | \n",
" 96.5 | \n",
" 2/10/09 16:03 | \n",
" 2/10/09 16:27 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mmsi name transit segment seg_length avg_sog min_sog max_sog \\\n",
"0 1 Us Govt Ves 1 1 5.1 13.2 9.2 14.5 \n",
"\n",
" pdgt10 st_time end_time \n",
"0 96.5 2/10/09 16:03 2/10/09 16:27 "
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segments.head(1)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "17fa25c2",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" num_names | \n",
" names | \n",
" sov | \n",
" flag | \n",
" flag_type | \n",
" num_loas | \n",
" loa | \n",
" max_loa | \n",
" num_types | \n",
" type | \n",
"
\n",
" \n",
" mmsi | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" num_names names sov \\\n",
"mmsi \n",
"1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"\n",
" flag flag_type num_loas loa \\\n",
"mmsi \n",
"1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"\n",
" max_loa num_types type \n",
"mmsi \n",
"1 156.0 4 Dredging/MilOps/Reserved/Towing "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vessels.head(1)"
]
},
{
"cell_type": "markdown",
"id": "0766d18f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We see that there is a `mmsi` value (a vessel identifier) in each table, but it is used as an index for the `vessels` table. In this case, we have to specify to join on the index for this table, and on the `mmsi` column for the other."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "1c9d7b17",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"segments_merged = pd.merge(vessels, segments, left_index=True, right_on='mmsi')"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "6a8100a4",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" num_names | \n",
" names | \n",
" sov | \n",
" flag | \n",
" flag_type | \n",
" num_loas | \n",
" loa | \n",
" max_loa | \n",
" num_types | \n",
" type | \n",
" ... | \n",
" name | \n",
" transit | \n",
" segment | \n",
" seg_length | \n",
" avg_sog | \n",
" min_sog | \n",
" max_sog | \n",
" pdgt10 | \n",
" st_time | \n",
" end_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Us Govt Ves | \n",
" 1 | \n",
" 1 | \n",
" 5.1 | \n",
" 13.2 | \n",
" 9.2 | \n",
" 14.5 | \n",
" 96.5 | \n",
" 2/10/09 16:03 | \n",
" 2/10/09 16:27 | \n",
"
\n",
" \n",
" 1 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Dredge Capt Frank | \n",
" 1 | \n",
" 1 | \n",
" 13.5 | \n",
" 18.6 | \n",
" 10.4 | \n",
" 20.6 | \n",
" 100.0 | \n",
" 4/6/09 14:31 | \n",
" 4/6/09 15:20 | \n",
"
\n",
" \n",
" 2 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Us Gov Vessel | \n",
" 1 | \n",
" 1 | \n",
" 4.3 | \n",
" 16.2 | \n",
" 10.3 | \n",
" 20.5 | \n",
" 100.0 | \n",
" 4/6/09 14:36 | \n",
" 4/6/09 14:55 | \n",
"
\n",
" \n",
" 3 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Us Gov Vessel | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.5 | \n",
" 16.1 | \n",
" 100.0 | \n",
" 4/10/09 17:58 | \n",
" 4/10/09 18:34 | \n",
"
\n",
" \n",
" 4 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Dredge Capt Frank | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.6 | \n",
" 16.2 | \n",
" 100.0 | \n",
" 4/10/09 17:59 | \n",
" 4/10/09 18:35 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 262521 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 3 | \n",
" 1 | \n",
" 5.3 | \n",
" 20.0 | \n",
" 19.6 | \n",
" 20.4 | \n",
" 100.0 | \n",
" 6/15/10 12:49 | \n",
" 6/15/10 13:05 | \n",
"
\n",
" \n",
" 262522 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 4 | \n",
" 1 | \n",
" 18.7 | \n",
" 19.2 | \n",
" 18.4 | \n",
" 19.9 | \n",
" 100.0 | \n",
" 6/15/10 21:32 | \n",
" 6/15/10 22:29 | \n",
"
\n",
" \n",
" 262523 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 6 | \n",
" 1 | \n",
" 17.4 | \n",
" 17.0 | \n",
" 14.7 | \n",
" 18.4 | \n",
" 100.0 | \n",
" 6/17/10 19:16 | \n",
" 6/17/10 20:17 | \n",
"
\n",
" \n",
" 262524 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 7 | \n",
" 1 | \n",
" 31.5 | \n",
" 14.2 | \n",
" 13.4 | \n",
" 15.1 | \n",
" 100.0 | \n",
" 6/18/10 2:52 | \n",
" 6/18/10 5:03 | \n",
"
\n",
" \n",
" 262525 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 8 | \n",
" 1 | \n",
" 19.8 | \n",
" 18.6 | \n",
" 16.1 | \n",
" 19.5 | \n",
" 100.0 | \n",
" 6/18/10 10:19 | \n",
" 6/18/10 11:22 | \n",
"
\n",
" \n",
"
\n",
"
262353 rows × 21 columns
\n",
"
"
],
"text/plain": [
" num_names names sov \\\n",
"0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"... ... ... .. \n",
"262521 1 Triple Attraction N \n",
"262522 1 Triple Attraction N \n",
"262523 1 Triple Attraction N \n",
"262524 1 Triple Attraction N \n",
"262525 1 Triple Attraction N \n",
"\n",
" flag flag_type num_loas loa \\\n",
"0 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"2 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"3 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"4 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"... ... ... ... ... \n",
"262521 Unknown Unknown 1 30.0 \n",
"262522 Unknown Unknown 1 30.0 \n",
"262523 Unknown Unknown 1 30.0 \n",
"262524 Unknown Unknown 1 30.0 \n",
"262525 Unknown Unknown 1 30.0 \n",
"\n",
" max_loa num_types type ... \\\n",
"0 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"1 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"2 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"3 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"4 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"... ... ... ... ... \n",
"262521 30.0 1 Pleasure ... \n",
"262522 30.0 1 Pleasure ... \n",
"262523 30.0 1 Pleasure ... \n",
"262524 30.0 1 Pleasure ... \n",
"262525 30.0 1 Pleasure ... \n",
"\n",
" name transit segment seg_length avg_sog min_sog \\\n",
"0 Us Govt Ves 1 1 5.1 13.2 9.2 \n",
"1 Dredge Capt Frank 1 1 13.5 18.6 10.4 \n",
"2 Us Gov Vessel 1 1 4.3 16.2 10.3 \n",
"3 Us Gov Vessel 2 1 9.2 15.4 14.5 \n",
"4 Dredge Capt Frank 2 1 9.2 15.4 14.6 \n",
"... ... ... ... ... ... ... \n",
"262521 Triple Attraction 3 1 5.3 20.0 19.6 \n",
"262522 Triple Attraction 4 1 18.7 19.2 18.4 \n",
"262523 Triple Attraction 6 1 17.4 17.0 14.7 \n",
"262524 Triple Attraction 7 1 31.5 14.2 13.4 \n",
"262525 Triple Attraction 8 1 19.8 18.6 16.1 \n",
"\n",
" max_sog pdgt10 st_time end_time \n",
"0 14.5 96.5 2/10/09 16:03 2/10/09 16:27 \n",
"1 20.6 100.0 4/6/09 14:31 4/6/09 15:20 \n",
"2 20.5 100.0 4/6/09 14:36 4/6/09 14:55 \n",
"3 16.1 100.0 4/10/09 17:58 4/10/09 18:34 \n",
"4 16.2 100.0 4/10/09 17:59 4/10/09 18:35 \n",
"... ... ... ... ... \n",
"262521 20.4 100.0 6/15/10 12:49 6/15/10 13:05 \n",
"262522 19.9 100.0 6/15/10 21:32 6/15/10 22:29 \n",
"262523 18.4 100.0 6/17/10 19:16 6/17/10 20:17 \n",
"262524 15.1 100.0 6/18/10 2:52 6/18/10 5:03 \n",
"262525 19.5 100.0 6/18/10 10:19 6/18/10 11:22 \n",
"\n",
"[262353 rows x 21 columns]"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segments_merged"
]
},
{
"cell_type": "markdown",
"id": "5e9869a2",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- In this case, the default inner join is suitable; we are not interested in observations from either table that do not have corresponding entries in the other. \n",
"\n",
"- Notice that `mmsi` field that was an index on the `vessels` table is no longer an index on the merged table.\n",
"\n",
"- Here, we used the `merge` function to perform the merge; we could also have used the `merge` method for either of the tables:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "34588c9a",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" num_names | \n",
" names | \n",
" sov | \n",
" flag | \n",
" flag_type | \n",
" num_loas | \n",
" loa | \n",
" max_loa | \n",
" num_types | \n",
" type | \n",
" ... | \n",
" name | \n",
" transit | \n",
" segment | \n",
" seg_length | \n",
" avg_sog | \n",
" min_sog | \n",
" max_sog | \n",
" pdgt10 | \n",
" st_time | \n",
" end_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Us Govt Ves | \n",
" 1 | \n",
" 1 | \n",
" 5.1 | \n",
" 13.2 | \n",
" 9.2 | \n",
" 14.5 | \n",
" 96.5 | \n",
" 2/10/09 16:03 | \n",
" 2/10/09 16:27 | \n",
"
\n",
" \n",
" 1 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Dredge Capt Frank | \n",
" 1 | \n",
" 1 | \n",
" 13.5 | \n",
" 18.6 | \n",
" 10.4 | \n",
" 20.6 | \n",
" 100.0 | \n",
" 4/6/09 14:31 | \n",
" 4/6/09 15:20 | \n",
"
\n",
" \n",
" 2 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Us Gov Vessel | \n",
" 1 | \n",
" 1 | \n",
" 4.3 | \n",
" 16.2 | \n",
" 10.3 | \n",
" 20.5 | \n",
" 100.0 | \n",
" 4/6/09 14:36 | \n",
" 4/6/09 14:55 | \n",
"
\n",
" \n",
" 3 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Us Gov Vessel | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.5 | \n",
" 16.1 | \n",
" 100.0 | \n",
" 4/10/09 17:58 | \n",
" 4/10/09 18:34 | \n",
"
\n",
" \n",
" 4 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" Dredge Capt Frank | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.6 | \n",
" 16.2 | \n",
" 100.0 | \n",
" 4/10/09 17:59 | \n",
" 4/10/09 18:35 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 262521 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 3 | \n",
" 1 | \n",
" 5.3 | \n",
" 20.0 | \n",
" 19.6 | \n",
" 20.4 | \n",
" 100.0 | \n",
" 6/15/10 12:49 | \n",
" 6/15/10 13:05 | \n",
"
\n",
" \n",
" 262522 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 4 | \n",
" 1 | \n",
" 18.7 | \n",
" 19.2 | \n",
" 18.4 | \n",
" 19.9 | \n",
" 100.0 | \n",
" 6/15/10 21:32 | \n",
" 6/15/10 22:29 | \n",
"
\n",
" \n",
" 262523 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 6 | \n",
" 1 | \n",
" 17.4 | \n",
" 17.0 | \n",
" 14.7 | \n",
" 18.4 | \n",
" 100.0 | \n",
" 6/17/10 19:16 | \n",
" 6/17/10 20:17 | \n",
"
\n",
" \n",
" 262524 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 7 | \n",
" 1 | \n",
" 31.5 | \n",
" 14.2 | \n",
" 13.4 | \n",
" 15.1 | \n",
" 100.0 | \n",
" 6/18/10 2:52 | \n",
" 6/18/10 5:03 | \n",
"
\n",
" \n",
" 262525 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" Triple Attraction | \n",
" 8 | \n",
" 1 | \n",
" 19.8 | \n",
" 18.6 | \n",
" 16.1 | \n",
" 19.5 | \n",
" 100.0 | \n",
" 6/18/10 10:19 | \n",
" 6/18/10 11:22 | \n",
"
\n",
" \n",
"
\n",
"
262353 rows × 21 columns
\n",
"
"
],
"text/plain": [
" num_names names sov \\\n",
"0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"... ... ... .. \n",
"262521 1 Triple Attraction N \n",
"262522 1 Triple Attraction N \n",
"262523 1 Triple Attraction N \n",
"262524 1 Triple Attraction N \n",
"262525 1 Triple Attraction N \n",
"\n",
" flag flag_type num_loas loa \\\n",
"0 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"2 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"3 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"4 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"... ... ... ... ... \n",
"262521 Unknown Unknown 1 30.0 \n",
"262522 Unknown Unknown 1 30.0 \n",
"262523 Unknown Unknown 1 30.0 \n",
"262524 Unknown Unknown 1 30.0 \n",
"262525 Unknown Unknown 1 30.0 \n",
"\n",
" max_loa num_types type ... \\\n",
"0 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"1 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"2 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"3 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"4 156.0 4 Dredging/MilOps/Reserved/Towing ... \n",
"... ... ... ... ... \n",
"262521 30.0 1 Pleasure ... \n",
"262522 30.0 1 Pleasure ... \n",
"262523 30.0 1 Pleasure ... \n",
"262524 30.0 1 Pleasure ... \n",
"262525 30.0 1 Pleasure ... \n",
"\n",
" name transit segment seg_length avg_sog min_sog \\\n",
"0 Us Govt Ves 1 1 5.1 13.2 9.2 \n",
"1 Dredge Capt Frank 1 1 13.5 18.6 10.4 \n",
"2 Us Gov Vessel 1 1 4.3 16.2 10.3 \n",
"3 Us Gov Vessel 2 1 9.2 15.4 14.5 \n",
"4 Dredge Capt Frank 2 1 9.2 15.4 14.6 \n",
"... ... ... ... ... ... ... \n",
"262521 Triple Attraction 3 1 5.3 20.0 19.6 \n",
"262522 Triple Attraction 4 1 18.7 19.2 18.4 \n",
"262523 Triple Attraction 6 1 17.4 17.0 14.7 \n",
"262524 Triple Attraction 7 1 31.5 14.2 13.4 \n",
"262525 Triple Attraction 8 1 19.8 18.6 16.1 \n",
"\n",
" max_sog pdgt10 st_time end_time \n",
"0 14.5 96.5 2/10/09 16:03 2/10/09 16:27 \n",
"1 20.6 100.0 4/6/09 14:31 4/6/09 15:20 \n",
"2 20.5 100.0 4/6/09 14:36 4/6/09 14:55 \n",
"3 16.1 100.0 4/10/09 17:58 4/10/09 18:34 \n",
"4 16.2 100.0 4/10/09 17:59 4/10/09 18:35 \n",
"... ... ... ... ... \n",
"262521 20.4 100.0 6/15/10 12:49 6/15/10 13:05 \n",
"262522 19.9 100.0 6/15/10 21:32 6/15/10 22:29 \n",
"262523 18.4 100.0 6/17/10 19:16 6/17/10 20:17 \n",
"262524 15.1 100.0 6/18/10 2:52 6/18/10 5:03 \n",
"262525 19.5 100.0 6/18/10 10:19 6/18/10 11:22 \n",
"\n",
"[262353 rows x 21 columns]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vessels.merge(segments, left_index=True, right_on='mmsi')"
]
},
{
"cell_type": "markdown",
"id": "b29c609f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Occasionally, there will be fields with the same in both tables that we do not wish to use to join the tables; they may contain different information, despite having the same name. In this case, Pandas will by default append suffixes `_x` and `_y` to the columns to uniquely identify them."
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "98fa379d",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" num_names | \n",
" names | \n",
" sov | \n",
" flag | \n",
" flag_type | \n",
" num_loas | \n",
" loa | \n",
" max_loa | \n",
" num_types | \n",
" type_x | \n",
" ... | \n",
" transit | \n",
" segment | \n",
" seg_length | \n",
" avg_sog | \n",
" min_sog | \n",
" max_sog | \n",
" pdgt10 | \n",
" st_time | \n",
" end_time | \n",
" type_y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" 1 | \n",
" 1 | \n",
" 5.1 | \n",
" 13.2 | \n",
" 9.2 | \n",
" 14.5 | \n",
" 96.5 | \n",
" 2/10/09 16:03 | \n",
" 2/10/09 16:27 | \n",
" foo | \n",
"
\n",
" \n",
" 1 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" 1 | \n",
" 1 | \n",
" 13.5 | \n",
" 18.6 | \n",
" 10.4 | \n",
" 20.6 | \n",
" 100.0 | \n",
" 4/6/09 14:31 | \n",
" 4/6/09 15:20 | \n",
" foo | \n",
"
\n",
" \n",
" 2 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" 1 | \n",
" 1 | \n",
" 4.3 | \n",
" 16.2 | \n",
" 10.3 | \n",
" 20.5 | \n",
" 100.0 | \n",
" 4/6/09 14:36 | \n",
" 4/6/09 14:55 | \n",
" foo | \n",
"
\n",
" \n",
" 3 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.5 | \n",
" 16.1 | \n",
" 100.0 | \n",
" 4/10/09 17:58 | \n",
" 4/10/09 18:34 | \n",
" foo | \n",
"
\n",
" \n",
" 4 | \n",
" 8 | \n",
" Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | \n",
" Y | \n",
" Unknown | \n",
" Unknown | \n",
" 7 | \n",
" 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | \n",
" 156.0 | \n",
" 4 | \n",
" Dredging/MilOps/Reserved/Towing | \n",
" ... | \n",
" 2 | \n",
" 1 | \n",
" 9.2 | \n",
" 15.4 | \n",
" 14.6 | \n",
" 16.2 | \n",
" 100.0 | \n",
" 4/10/09 17:59 | \n",
" 4/10/09 18:35 | \n",
" foo | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 262521 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" 3 | \n",
" 1 | \n",
" 5.3 | \n",
" 20.0 | \n",
" 19.6 | \n",
" 20.4 | \n",
" 100.0 | \n",
" 6/15/10 12:49 | \n",
" 6/15/10 13:05 | \n",
" foo | \n",
"
\n",
" \n",
" 262522 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" 4 | \n",
" 1 | \n",
" 18.7 | \n",
" 19.2 | \n",
" 18.4 | \n",
" 19.9 | \n",
" 100.0 | \n",
" 6/15/10 21:32 | \n",
" 6/15/10 22:29 | \n",
" foo | \n",
"
\n",
" \n",
" 262523 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" 6 | \n",
" 1 | \n",
" 17.4 | \n",
" 17.0 | \n",
" 14.7 | \n",
" 18.4 | \n",
" 100.0 | \n",
" 6/17/10 19:16 | \n",
" 6/17/10 20:17 | \n",
" foo | \n",
"
\n",
" \n",
" 262524 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" 7 | \n",
" 1 | \n",
" 31.5 | \n",
" 14.2 | \n",
" 13.4 | \n",
" 15.1 | \n",
" 100.0 | \n",
" 6/18/10 2:52 | \n",
" 6/18/10 5:03 | \n",
" foo | \n",
"
\n",
" \n",
" 262525 | \n",
" 1 | \n",
" Triple Attraction | \n",
" N | \n",
" Unknown | \n",
" Unknown | \n",
" 1 | \n",
" 30.0 | \n",
" 30.0 | \n",
" 1 | \n",
" Pleasure | \n",
" ... | \n",
" 8 | \n",
" 1 | \n",
" 19.8 | \n",
" 18.6 | \n",
" 16.1 | \n",
" 19.5 | \n",
" 100.0 | \n",
" 6/18/10 10:19 | \n",
" 6/18/10 11:22 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
262353 rows × 22 columns
\n",
"
"
],
"text/plain": [
" num_names names sov \\\n",
"0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y \n",
"... ... ... .. \n",
"262521 1 Triple Attraction N \n",
"262522 1 Triple Attraction N \n",
"262523 1 Triple Attraction N \n",
"262524 1 Triple Attraction N \n",
"262525 1 Triple Attraction N \n",
"\n",
" flag flag_type num_loas loa \\\n",
"0 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"1 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"2 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"3 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"4 Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 \n",
"... ... ... ... ... \n",
"262521 Unknown Unknown 1 30.0 \n",
"262522 Unknown Unknown 1 30.0 \n",
"262523 Unknown Unknown 1 30.0 \n",
"262524 Unknown Unknown 1 30.0 \n",
"262525 Unknown Unknown 1 30.0 \n",
"\n",
" max_loa num_types type_x ... transit \\\n",
"0 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 \n",
"1 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 \n",
"2 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 \n",
"3 156.0 4 Dredging/MilOps/Reserved/Towing ... 2 \n",
"4 156.0 4 Dredging/MilOps/Reserved/Towing ... 2 \n",
"... ... ... ... ... ... \n",
"262521 30.0 1 Pleasure ... 3 \n",
"262522 30.0 1 Pleasure ... 4 \n",
"262523 30.0 1 Pleasure ... 6 \n",
"262524 30.0 1 Pleasure ... 7 \n",
"262525 30.0 1 Pleasure ... 8 \n",
"\n",
" segment seg_length avg_sog min_sog max_sog pdgt10 st_time \\\n",
"0 1 5.1 13.2 9.2 14.5 96.5 2/10/09 16:03 \n",
"1 1 13.5 18.6 10.4 20.6 100.0 4/6/09 14:31 \n",
"2 1 4.3 16.2 10.3 20.5 100.0 4/6/09 14:36 \n",
"3 1 9.2 15.4 14.5 16.1 100.0 4/10/09 17:58 \n",
"4 1 9.2 15.4 14.6 16.2 100.0 4/10/09 17:59 \n",
"... ... ... ... ... ... ... ... \n",
"262521 1 5.3 20.0 19.6 20.4 100.0 6/15/10 12:49 \n",
"262522 1 18.7 19.2 18.4 19.9 100.0 6/15/10 21:32 \n",
"262523 1 17.4 17.0 14.7 18.4 100.0 6/17/10 19:16 \n",
"262524 1 31.5 14.2 13.4 15.1 100.0 6/18/10 2:52 \n",
"262525 1 19.8 18.6 16.1 19.5 100.0 6/18/10 10:19 \n",
"\n",
" end_time type_y \n",
"0 2/10/09 16:27 foo \n",
"1 4/6/09 15:20 foo \n",
"2 4/6/09 14:55 foo \n",
"3 4/10/09 18:34 foo \n",
"4 4/10/09 18:35 foo \n",
"... ... ... \n",
"262521 6/15/10 13:05 foo \n",
"262522 6/15/10 22:29 foo \n",
"262523 6/17/10 20:17 foo \n",
"262524 6/18/10 5:03 foo \n",
"262525 6/18/10 11:22 foo \n",
"\n",
"[262353 rows x 22 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segments['type'] = 'foo'\n",
"pd.merge(vessels, segments, left_index=True, right_on='mmsi')"
]
},
{
"cell_type": "markdown",
"id": "b034ddd3",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- This behavior can be overridden by specifying a `suffixes` argument, containing a list of the suffixes to be used for the columns of the left and right columns, respectively."
]
},
{
"cell_type": "markdown",
"id": "0efb4d13",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Concatenation\n",
"\n",
"A common data manipulation is appending rows or columns to a dataset that already conform to the dimensions of the exsiting rows or colums, respectively. In NumPy, this is done either with `concatenate` or the convenience functions `c_` and `r_`:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "a1460278",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.42336207, 0.74807952, 0.61839076, 0.54794432, 0.06227732,\n",
" 0.71618874, 0.31763132, 0.26021656, 0.22395665, 0.08499033])"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.concatenate([np.random.random(5), np.random.random(5)])"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "6fb9b19a",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.48601962, 0.11484701, 0.93892836, 0.16884999, 0.71700162,\n",
" 0.92519913, 0.26827622, 0.41866975, 0.59348726, 0.06054373])"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.r_[np.random.random(5), np.random.random(5)]"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "52cc9244",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.48861909, 0.23961022],\n",
" [0.68685816, 0.7662155 ],\n",
" [0.76304197, 0.63356894],\n",
" [0.45533848, 0.36265383],\n",
" [0.85205653, 0.84605096]])"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.c_[np.random.random(5), np.random.random(5)]"
]
},
{
"cell_type": "markdown",
"id": "ef185e6f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- This operation is also called *binding* or *stacking*. With Pandas' indexed data structures, there are additional considerations as the overlap in index values between two data structures affects how they are concatenate.\n",
"\n",
"- Lets import two microbiome datasets, each consisting of counts of microorganiams from a particular patient. We will use the first column of each dataset as the index."
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "85e0f589",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in indexes: https://mirrors.163.com/pypi/simple/\r\n",
"Requirement already satisfied: xlrd in /home/fli/.local/lib/python3.9/site-packages (2.0.1)\r\n",
"Requirement already satisfied: openpyxl in /home/fli/.local/lib/python3.9/site-packages (3.0.9)\r\n",
"Requirement already satisfied: et-xmlfile in /home/fli/.local/lib/python3.9/site-packages (from openpyxl) (1.0.1)\r\n"
]
}
],
"source": [
"# Pandas requires external modules to read Excel files\n",
"! pip3 install xlrd openpyxl --user"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "bc15a3d7",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"((272, 1), (288, 1))"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mb1 = pd.read_excel('data/microbiome/MID1.xls', 'Sheet 1', index_col=0, header=None)\n",
"mb2 = pd.read_excel('data/microbiome/MID2.xls', 'Sheet 1', index_col=0, header=None)\n",
"mb1.shape, mb2.shape"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "f348aed9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 | \n",
"
\n",
" \n",
" 0 | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | \n",
" 7 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | \n",
" 2 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | \n",
" 3 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | \n",
" 3 | \n",
"
\n",
" \n",
" Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella | \n",
" 7 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Kosmotoga | \n",
" 9 | \n",
"
\n",
" \n",
" Bacteria \"Verrucomicrobia\" Opitutae Opitutales Opitutaceae Alterococcus | \n",
" 1 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | \n",
" 2 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | \n",
" 85 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | \n",
" 1388 | \n",
"
\n",
" \n",
"
\n",
"
272 rows × 1 columns
\n",
"
"
],
"text/plain": [
" 1\n",
"0 \n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2\n",
"Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3\n",
"Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3\n",
"Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7\n",
"... ...\n",
"Bacteria \"Thermotogae\" Thermotogae Thermotogale... 9\n",
"Bacteria \"Verrucomicrobia\" Opitutae Opitutales ... 1\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 2\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 85\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388\n",
"\n",
"[272 rows x 1 columns]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mb1"
]
},
{
"cell_type": "markdown",
"id": "408fb81f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Let's give the index and columns meaningful labels:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "d0125a78",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"mb1.columns = mb2.columns = ['Count']"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "47a30002",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"mb1.index.name = mb2.index.name = 'Taxon'"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "7cf2b2ab",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count | \n",
"
\n",
" \n",
" Taxon | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | \n",
" 7 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | \n",
" 2 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | \n",
" 3 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | \n",
" 3 | \n",
"
\n",
" \n",
" Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella | \n",
" 7 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Kosmotoga | \n",
" 9 | \n",
"
\n",
" \n",
" Bacteria \"Verrucomicrobia\" Opitutae Opitutales Opitutaceae Alterococcus | \n",
" 1 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | \n",
" 2 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | \n",
" 85 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | \n",
" 1388 | \n",
"
\n",
" \n",
"
\n",
"
272 rows × 1 columns
\n",
"
"
],
"text/plain": [
" Count\n",
"Taxon \n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2\n",
"Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3\n",
"Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3\n",
"Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7\n",
"... ...\n",
"Bacteria \"Thermotogae\" Thermotogae Thermotogale... 9\n",
"Bacteria \"Verrucomicrobia\" Opitutae Opitutales ... 1\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 2\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 85\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388\n",
"\n",
"[272 rows x 1 columns]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mb1"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "483dcac0",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera',\n",
" 'Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus',\n",
" 'Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus'],\n",
" dtype='object', name='Taxon')"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mb1.index[:3]"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "6c35dc94",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mb1.index.is_unique"
]
},
{
"cell_type": "markdown",
"id": "8d2a5674",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If we concatenate along `axis=0` (the default), we will obtain another data frame with the the rows concatenated:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "d03a160f",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(560, 1)"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([mb1, mb2], axis=0).shape"
]
},
{
"cell_type": "markdown",
"id": "7f3bd46b",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"However, the index is no longer unique, due to overlap between the two DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "bc511917",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([mb1, mb2], axis=0).index.is_unique"
]
},
{
"cell_type": "markdown",
"id": "2d4057aa",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Concatenating along `axis=1` will concatenate column-wise, but respecting the indices of the two DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "100de543",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(438, 2)"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([mb1, mb2], axis=1).shape"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "a515f6bd",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count | \n",
" Count | \n",
"
\n",
" \n",
" Taxon | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | \n",
" 7.0 | \n",
" 23.0 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | \n",
" 2.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | \n",
" 3.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | \n",
" 3.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella | \n",
" 7.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Bacteria \"Proteobacteria\" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
" Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio | \n",
" NaN | \n",
" 9.0 | \n",
"
\n",
" \n",
" Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
" Bacteria \"Thermodesulfobacteria\" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator | \n",
" NaN | \n",
" 3.0 | \n",
"
\n",
" \n",
" Bacteria TM7 TM7_genera_incertae_sedis | \n",
" NaN | \n",
" 2.0 | \n",
"
\n",
" \n",
"
\n",
"
438 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Count Count\n",
"Taxon \n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7.0 23.0\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2.0 2.0\n",
"Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3.0 10.0\n",
"Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3.0 9.0\n",
"Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7.0 9.0\n",
"... ... ...\n",
"Bacteria \"Proteobacteria\" Gammaproteobacteria O... NaN 1.0\n",
"Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 9.0\n",
"Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 1.0\n",
"Bacteria \"Thermodesulfobacteria\" Thermodesulfob... NaN 3.0\n",
"Bacteria TM7 TM7_genera_incertae_sedis NaN 2.0\n",
"\n",
"[438 rows x 2 columns]"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([mb1, mb2], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "5613a854",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 7., 23.],\n",
" [ 2., 2.],\n",
" [ 3., 10.],\n",
" [ 3., 9.],\n",
" [ 7., 9.]])"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([mb1, mb2], axis=1).values[:5]"
]
},
{
"cell_type": "markdown",
"id": "44b0adc2",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If we are only interested in taxa that are included in both DataFrames, we can specify a `join=inner` argument."
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "dc8b4304",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count | \n",
" Count | \n",
"
\n",
" \n",
" Taxon | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | \n",
" 7 | \n",
" 23 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | \n",
" 3 | \n",
" 10 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | \n",
" 3 | \n",
" 9 | \n",
"
\n",
" \n",
" Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella | \n",
" 7 | \n",
" 9 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Bacteria \"Thermodesulfobacteria\" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Geotoga | \n",
" 7 | \n",
" 15 | \n",
"
\n",
" \n",
" Bacteria \"Thermotogae\" Thermotogae Thermotogales Thermotogaceae Kosmotoga | \n",
" 9 | \n",
" 22 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | \n",
" 85 | \n",
" 1 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | \n",
" 1388 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
122 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Count Count\n",
"Taxon \n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7 23\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2 2\n",
"Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3 10\n",
"Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3 9\n",
"Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7 9\n",
"... ... ...\n",
"Bacteria \"Thermodesulfobacteria\" Thermodesulfob... 1 1\n",
"Bacteria \"Thermotogae\" Thermotogae Thermotogale... 7 15\n",
"Bacteria \"Thermotogae\" Thermotogae Thermotogale... 9 22\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 85 1\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388 2\n",
"\n",
"[122 rows x 2 columns]"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([mb1, mb2], axis=1, join='inner')"
]
},
{
"cell_type": "markdown",
"id": "9f463b07",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If we wanted to use the second table to fill values absent from the first table, we could use `combine_first`."
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "eae22a1d",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count | \n",
"
\n",
" \n",
" Taxon | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Acidilobales Acidilobaceae Acidilobus | \n",
" 2 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera | \n",
" 14 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | \n",
" 7 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus | \n",
" 1 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera | \n",
" 2 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Bacteria \"Verrucomicrobia\" Opitutae Opitutales Opitutaceae Alterococcus | \n",
" 1 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | \n",
" 2 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | \n",
" 85 | \n",
"
\n",
" \n",
" Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | \n",
" 1388 | \n",
"
\n",
" \n",
" Bacteria TM7 TM7_genera_incertae_sedis | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
438 rows × 1 columns
\n",
"
"
],
"text/plain": [
" Count\n",
"Taxon \n",
"Archaea \"Crenarchaeota\" Thermoprotei Acidilobal... 2\n",
"Archaea \"Crenarchaeota\" Thermoprotei Acidilobal... 14\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 1\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2\n",
"... ...\n",
"Bacteria \"Verrucomicrobia\" Opitutae Opitutales ... 1\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 2\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 85\n",
"Bacteria Cyanobacteria Cyanobacteria Chloropla... 1388\n",
"Bacteria TM7 TM7_genera_incertae_sedis 2\n",
"\n",
"[438 rows x 1 columns]"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mb1.combine_first(mb2)"
]
},
{
"cell_type": "markdown",
"id": "f11404c4",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Alternatively, you can pass keys to the concatenation by supplying the DataFrames (or Series) as a dict."
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "d96c0db5",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" patient1 | \n",
" patient2 | \n",
"
\n",
" \n",
" | \n",
" Count | \n",
" Count | \n",
"
\n",
" \n",
" Taxon | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | \n",
" 7.0 | \n",
" 23.0 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | \n",
" 2.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | \n",
" 3.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
" Archaea \"Crenarchaeota\" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | \n",
" 3.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" Archaea \"Euryarchaeota\" \"Methanomicrobia\" Methanocellales Methanocellaceae Methanocella | \n",
" 7.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Bacteria \"Proteobacteria\" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
" Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio | \n",
" NaN | \n",
" 9.0 | \n",
"
\n",
" \n",
" Bacteria \"Proteobacteria\" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
" Bacteria \"Thermodesulfobacteria\" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator | \n",
" NaN | \n",
" 3.0 | \n",
"
\n",
" \n",
" Bacteria TM7 TM7_genera_incertae_sedis | \n",
" NaN | \n",
" 2.0 | \n",
"
\n",
" \n",
"
\n",
"
438 rows × 2 columns
\n",
"
"
],
"text/plain": [
" patient1 patient2\n",
" Count Count\n",
"Taxon \n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 7.0 23.0\n",
"Archaea \"Crenarchaeota\" Thermoprotei Desulfuroc... 2.0 2.0\n",
"Archaea \"Crenarchaeota\" Thermoprotei Sulfolobal... 3.0 10.0\n",
"Archaea \"Crenarchaeota\" Thermoprotei Thermoprot... 3.0 9.0\n",
"Archaea \"Euryarchaeota\" \"Methanomicrobia\" Metha... 7.0 9.0\n",
"... ... ...\n",
"Bacteria \"Proteobacteria\" Gammaproteobacteria O... NaN 1.0\n",
"Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 9.0\n",
"Bacteria \"Proteobacteria\" Gammaproteobacteria T... NaN 1.0\n",
"Bacteria \"Thermodesulfobacteria\" Thermodesulfob... NaN 3.0\n",
"Bacteria TM7 TM7_genera_incertae_sedis NaN 2.0\n",
"\n",
"[438 rows x 2 columns]"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(dict(patient1=mb1, patient2=mb2), axis=1)"
]
},
{
"cell_type": "markdown",
"id": "c86b6d65",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- If you want `concat` to work like `numpy.concatanate`, you may provide the `ignore_index=True` argument."
]
},
{
"cell_type": "markdown",
"id": "cf86c7c6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Missing data\n",
"\n",
"The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.\n",
"\n",
"Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (*e.g.* NumPy)."
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "d534af5b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 NaN\n",
"1 -3\n",
"2 None\n",
"3 foobar\n",
"dtype: object"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"foo = pd.Series([np.nan, -3, None, 'foobar'])\n",
"foo"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "f47f7613",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 True\n",
"3 False\n",
"dtype: bool"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"foo.isnull()"
]
},
{
"cell_type": "markdown",
"id": "1c83fa2b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Missing values may be dropped or indexed out:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"id": "85eebf12",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Cyanobacteria NaN\n",
"Firmicutes 632.0\n",
"Proteobacteria 1638.0\n",
"Actinobacteria 569.0\n",
"dtype: float64"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}\n",
"bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])\n",
"bacteria2\n"
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "ecfc04d7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Firmicutes 632.0\n",
"Proteobacteria 1638.0\n",
"Actinobacteria 569.0\n",
"dtype: float64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria2.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "9e055763",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Firmicutes 632.0\n",
"Proteobacteria 1638.0\n",
"Actinobacteria 569.0\n",
"dtype: float64"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria2[bacteria2.notnull()]"
]
},
{
"cell_type": "markdown",
"id": "017f85a1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- This can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "708a9a0a",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.dropna(how='all')"
]
},
{
"cell_type": "markdown",
"id": "f217a7b6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9bc37ea9",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.dropna(thresh=4)"
]
},
{
"cell_type": "markdown",
"id": "6fa5235a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.\n",
"\n",
"- If we want to drop missing values column-wise instead of row-wise, we use `axis=1`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e98b3b29",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.dropna(axis=1)"
]
},
{
"cell_type": "markdown",
"id": "8d49d8f4",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the `fillna` argument."
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "a4049207",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Cyanobacteria 0.0\n",
"Firmicutes 632.0\n",
"Proteobacteria 1638.0\n",
"Actinobacteria 569.0\n",
"dtype: float64"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria2.fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"id": "8eb46c55",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" treatment | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1.0 | \n",
" 1994.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1.0 | \n",
" 1997.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1.0 | \n",
" 1999.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 115 | \n",
" 2.0 | \n",
" 2013.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2.0 | \n",
" 2015.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 1130 | \n",
" 2.0 | \n",
" 2017.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2.0 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2.0 | \n",
" 2021.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value treatment year\n",
"0 632 1.0 1994.0\n",
"1 1638 1.0 1997.0\n",
"2 569 1.0 1999.0\n",
"3 115 2.0 2013.0\n",
"4 433 2.0 2015.0\n",
"5 1130 2.0 2017.0\n",
"6 754 2.0 2019.0\n",
"7 555 2.0 2021.0"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],\n",
" 'treatment':[1, 1, 1, None, 2, 2, 2, 2],\n",
" 'year':[1994,1997,1999, None,2015,2017,2019,2021]})\n",
"\n",
"data.fillna({'year': 2013, 'treatment':2})"
]
},
{
"cell_type": "markdown",
"id": "1023ad7e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that `fillna` by default returns a new object with the desired filling behavior, rather than changing the `Series` or `DataFrame` in place.\n",
"\n",
"- We can alter values in-place using `inplace=True`."
]
},
{
"cell_type": "markdown",
"id": "04ddff7c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Data aggregation and GroupBy operations\n",
"\n",
"One of the most powerful features of Pandas is its **GroupBy** functionality. On occasion we may want to perform operations on *groups* of observations within a dataset. For exmaple:\n",
"\n",
"* **aggregation**, such as computing the sum of mean of each group, which involves applying a function to each group and returning the aggregated results\n",
"* **slicing** the DataFrame into groups and then doing something with the resulting slices (*e.g.* plotting)\n",
"* group-wise **transformation**, such as standardization/normalization"
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "45250d06",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" patient | \n",
" obs | \n",
" week | \n",
" site | \n",
" id | \n",
" treat | \n",
" age | \n",
" sex | \n",
" twstrs | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 5000U | \n",
" 65 | \n",
" F | \n",
" 32 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 5000U | \n",
" 65 | \n",
" F | \n",
" 30 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 5000U | \n",
" 65 | \n",
" F | \n",
" 24 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
" 8 | \n",
" 1 | \n",
" 1 | \n",
" 5000U | \n",
" 65 | \n",
" F | \n",
" 37 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 5 | \n",
" 12 | \n",
" 1 | \n",
" 1 | \n",
" 5000U | \n",
" 65 | \n",
" F | \n",
" 39 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 626 | \n",
" 109 | \n",
" 1 | \n",
" 0 | \n",
" 9 | \n",
" 11 | \n",
" 5000U | \n",
" 57 | \n",
" M | \n",
" 53 | \n",
"
\n",
" \n",
" 627 | \n",
" 109 | \n",
" 2 | \n",
" 2 | \n",
" 9 | \n",
" 11 | \n",
" 5000U | \n",
" 57 | \n",
" M | \n",
" 38 | \n",
"
\n",
" \n",
" 628 | \n",
" 109 | \n",
" 4 | \n",
" 8 | \n",
" 9 | \n",
" 11 | \n",
" 5000U | \n",
" 57 | \n",
" M | \n",
" 33 | \n",
"
\n",
" \n",
" 629 | \n",
" 109 | \n",
" 5 | \n",
" 12 | \n",
" 9 | \n",
" 11 | \n",
" 5000U | \n",
" 57 | \n",
" M | \n",
" 36 | \n",
"
\n",
" \n",
" 630 | \n",
" 109 | \n",
" 6 | \n",
" 16 | \n",
" 9 | \n",
" 11 | \n",
" 5000U | \n",
" 57 | \n",
" M | \n",
" 51 | \n",
"
\n",
" \n",
"
\n",
"
631 rows × 9 columns
\n",
"
"
],
"text/plain": [
" patient obs week site id treat age sex twstrs\n",
"0 1 1 0 1 1 5000U 65 F 32\n",
"1 1 2 2 1 1 5000U 65 F 30\n",
"2 1 3 4 1 1 5000U 65 F 24\n",
"3 1 4 8 1 1 5000U 65 F 37\n",
"4 1 5 12 1 1 5000U 65 F 39\n",
".. ... ... ... ... .. ... ... .. ...\n",
"626 109 1 0 9 11 5000U 57 M 53\n",
"627 109 2 2 9 11 5000U 57 M 38\n",
"628 109 4 8 9 11 5000U 57 M 33\n",
"629 109 5 12 9 11 5000U 57 M 36\n",
"630 109 6 16 9 11 5000U 57 M 51\n",
"\n",
"[631 rows x 9 columns]"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cdystonia = pd.read_csv(\"data/cdystonia.csv\")\n",
"cdystonia"
]
},
{
"cell_type": "code",
"execution_count": 92,
"id": "f75c1ce2",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cdystonia_grouped = cdystonia.groupby(cdystonia.patient)\n",
"cdystonia_grouped"
]
},
{
"cell_type": "markdown",
"id": "98c33a39",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- However, the grouping is only an intermediate step; for example, we may want to **iterate** over each of the patient groups:"
]
},
{
"cell_type": "code",
"execution_count": 93,
"id": "08779a9c",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# for patient, group in cdystonia_grouped:\n",
"# print(patient)\n",
"# print(group)"
]
},
{
"cell_type": "markdown",
"id": "5d96b068",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- A common data analysis procedure is the **split-apply-combine** operation, which groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table. For example, we may want to aggregate our data with with some function.\n",
"\n",
"- We can aggregate in Pandas using the `aggregate` (or `agg`, for short) method:"
]
},
{
"cell_type": "code",
"execution_count": 94,
"id": "bde72bdc",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" obs | \n",
" week | \n",
" site | \n",
" id | \n",
" age | \n",
" twstrs | \n",
"
\n",
" \n",
" patient | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 65.0 | \n",
" 33.000000 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 70.0 | \n",
" 47.666667 | \n",
"
\n",
" \n",
" 3 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 64.0 | \n",
" 30.500000 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.5 | \n",
" 3.5 | \n",
" 1.0 | \n",
" 4.0 | \n",
" 59.0 | \n",
" 60.000000 | \n",
"
\n",
" \n",
" 5 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 76.0 | \n",
" 46.166667 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 105 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 7.0 | \n",
" 79.0 | \n",
" 43.666667 | \n",
"
\n",
" \n",
" 106 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 8.0 | \n",
" 43.0 | \n",
" 67.666667 | \n",
"
\n",
" \n",
" 107 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 9.0 | \n",
" 50.0 | \n",
" 42.000000 | \n",
"
\n",
" \n",
" 108 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
" 39.0 | \n",
" 52.333333 | \n",
"
\n",
" \n",
" 109 | \n",
" 3.6 | \n",
" 7.6 | \n",
" 9.0 | \n",
" 11.0 | \n",
" 57.0 | \n",
" 42.200000 | \n",
"
\n",
" \n",
"
\n",
"
109 rows × 6 columns
\n",
"
"
],
"text/plain": [
" obs week site id age twstrs\n",
"patient \n",
"1 3.5 7.0 1.0 1.0 65.0 33.000000\n",
"2 3.5 7.0 1.0 2.0 70.0 47.666667\n",
"3 3.5 7.0 1.0 3.0 64.0 30.500000\n",
"4 2.5 3.5 1.0 4.0 59.0 60.000000\n",
"5 3.5 7.0 1.0 5.0 76.0 46.166667\n",
"... ... ... ... ... ... ...\n",
"105 3.5 7.0 9.0 7.0 79.0 43.666667\n",
"106 3.5 7.0 9.0 8.0 43.0 67.666667\n",
"107 3.5 7.0 9.0 9.0 50.0 42.000000\n",
"108 3.5 7.0 9.0 10.0 39.0 52.333333\n",
"109 3.6 7.6 9.0 11.0 57.0 42.200000\n",
"\n",
"[109 rows x 6 columns]"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"cdystonia_grouped.agg(np.mean)"
]
},
{
"cell_type": "markdown",
"id": "cb55ed92",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that the `treat` and `sex` variables are not included in the aggregation. Since it does not make sense to aggregate non-string variables, these columns are simply ignored by the method.\n",
"\n",
"- Some aggregation functions are so common that Pandas has a convenience method for them, such as `mean`:"
]
},
{
"cell_type": "code",
"execution_count": 95,
"id": "9ec0d138",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" obs | \n",
" week | \n",
" site | \n",
" id | \n",
" age | \n",
" twstrs | \n",
"
\n",
" \n",
" patient | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 65.0 | \n",
" 33.000000 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 70.0 | \n",
" 47.666667 | \n",
"
\n",
" \n",
" 3 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 64.0 | \n",
" 30.500000 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.5 | \n",
" 3.5 | \n",
" 1.0 | \n",
" 4.0 | \n",
" 59.0 | \n",
" 60.000000 | \n",
"
\n",
" \n",
" 5 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 76.0 | \n",
" 46.166667 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 105 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 7.0 | \n",
" 79.0 | \n",
" 43.666667 | \n",
"
\n",
" \n",
" 106 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 8.0 | \n",
" 43.0 | \n",
" 67.666667 | \n",
"
\n",
" \n",
" 107 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 9.0 | \n",
" 50.0 | \n",
" 42.000000 | \n",
"
\n",
" \n",
" 108 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
" 39.0 | \n",
" 52.333333 | \n",
"
\n",
" \n",
" 109 | \n",
" 3.6 | \n",
" 7.6 | \n",
" 9.0 | \n",
" 11.0 | \n",
" 57.0 | \n",
" 42.200000 | \n",
"
\n",
" \n",
"
\n",
"
109 rows × 6 columns
\n",
"
"
],
"text/plain": [
" obs week site id age twstrs\n",
"patient \n",
"1 3.5 7.0 1.0 1.0 65.0 33.000000\n",
"2 3.5 7.0 1.0 2.0 70.0 47.666667\n",
"3 3.5 7.0 1.0 3.0 64.0 30.500000\n",
"4 2.5 3.5 1.0 4.0 59.0 60.000000\n",
"5 3.5 7.0 1.0 5.0 76.0 46.166667\n",
"... ... ... ... ... ... ...\n",
"105 3.5 7.0 9.0 7.0 79.0 43.666667\n",
"106 3.5 7.0 9.0 8.0 43.0 67.666667\n",
"107 3.5 7.0 9.0 9.0 50.0 42.000000\n",
"108 3.5 7.0 9.0 10.0 39.0 52.333333\n",
"109 3.6 7.6 9.0 11.0 57.0 42.200000\n",
"\n",
"[109 rows x 6 columns]"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cdystonia_grouped.mean()"
]
},
{
"cell_type": "markdown",
"id": "2c1ee1ff",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- The `add_prefix` and `add_suffix` methods can be used to give the columns of the resulting table labels that reflect the transformation:"
]
},
{
"cell_type": "code",
"execution_count": 96,
"id": "d12a15bf",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" obs_mean | \n",
" week_mean | \n",
" site_mean | \n",
" id_mean | \n",
" age_mean | \n",
" twstrs_mean | \n",
"
\n",
" \n",
" patient | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 65.0 | \n",
" 33.000000 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 70.0 | \n",
" 47.666667 | \n",
"
\n",
" \n",
" 3 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 64.0 | \n",
" 30.500000 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.5 | \n",
" 3.5 | \n",
" 1.0 | \n",
" 4.0 | \n",
" 59.0 | \n",
" 60.000000 | \n",
"
\n",
" \n",
" 5 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 76.0 | \n",
" 46.166667 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 105 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 7.0 | \n",
" 79.0 | \n",
" 43.666667 | \n",
"
\n",
" \n",
" 106 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 8.0 | \n",
" 43.0 | \n",
" 67.666667 | \n",
"
\n",
" \n",
" 107 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 9.0 | \n",
" 50.0 | \n",
" 42.000000 | \n",
"
\n",
" \n",
" 108 | \n",
" 3.5 | \n",
" 7.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
" 39.0 | \n",
" 52.333333 | \n",
"
\n",
" \n",
" 109 | \n",
" 3.6 | \n",
" 7.6 | \n",
" 9.0 | \n",
" 11.0 | \n",
" 57.0 | \n",
" 42.200000 | \n",
"
\n",
" \n",
"
\n",
"
109 rows × 6 columns
\n",
"
"
],
"text/plain": [
" obs_mean week_mean site_mean id_mean age_mean twstrs_mean\n",
"patient \n",
"1 3.5 7.0 1.0 1.0 65.0 33.000000\n",
"2 3.5 7.0 1.0 2.0 70.0 47.666667\n",
"3 3.5 7.0 1.0 3.0 64.0 30.500000\n",
"4 2.5 3.5 1.0 4.0 59.0 60.000000\n",
"5 3.5 7.0 1.0 5.0 76.0 46.166667\n",
"... ... ... ... ... ... ...\n",
"105 3.5 7.0 9.0 7.0 79.0 43.666667\n",
"106 3.5 7.0 9.0 8.0 43.0 67.666667\n",
"107 3.5 7.0 9.0 9.0 50.0 42.000000\n",
"108 3.5 7.0 9.0 10.0 39.0 52.333333\n",
"109 3.6 7.6 9.0 11.0 57.0 42.200000\n",
"\n",
"[109 rows x 6 columns]"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cdystonia_grouped.mean().add_suffix('_mean')"
]
},
{
"cell_type": "code",
"execution_count": 97,
"id": "313219c9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"patient\n",
"1 34.0\n",
"2 50.5\n",
"3 30.5\n",
"4 61.5\n",
"5 48.5\n",
" ... \n",
"105 45.5\n",
"106 67.5\n",
"107 44.0\n",
"108 50.5\n",
"109 38.0\n",
"Name: twstrs, Length: 109, dtype: float64"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The median of the `twstrs` variable\n",
"cdystonia_grouped['twstrs'].quantile(0.5)"
]
},
{
"cell_type": "markdown",
"id": "390ffe52",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If we wish, we can easily aggregate according to multiple keys:"
]
},
{
"cell_type": "code",
"execution_count": 110,
"id": "a0d1e443",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" patient | \n",
" obs | \n",
" id | \n",
" age | \n",
" twstrs | \n",
"
\n",
" \n",
" week | \n",
" site | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 6.5 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 59.000000 | \n",
" 43.083333 | \n",
"
\n",
" \n",
" 2 | \n",
" 19.5 | \n",
" 1.0 | \n",
" 7.5 | \n",
" 53.928571 | \n",
" 51.857143 | \n",
"
\n",
" \n",
" 3 | \n",
" 32.5 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 51.500000 | \n",
" 38.750000 | \n",
"
\n",
" \n",
" 4 | \n",
" 42.5 | \n",
" 1.0 | \n",
" 4.5 | \n",
" 59.250000 | \n",
" 48.125000 | \n",
"
\n",
" \n",
" 5 | \n",
" 49.5 | \n",
" 1.0 | \n",
" 3.5 | \n",
" 51.833333 | \n",
" 49.333333 | \n",
"
\n",
" \n",
" 6 | \n",
" 60.0 | \n",
" 1.0 | \n",
" 8.0 | \n",
" 51.866667 | \n",
" 49.400000 | \n",
"
\n",
" \n",
" 7 | \n",
" 73.5 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 59.250000 | \n",
" 44.333333 | \n",
"
\n",
" \n",
" 8 | \n",
" 89.0 | \n",
" 1.0 | \n",
" 10.0 | \n",
" 57.263158 | \n",
" 38.631579 | \n",
"
\n",
" \n",
" 9 | \n",
" 104.0 | \n",
" 1.0 | \n",
" 6.0 | \n",
" 55.454545 | \n",
" 52.727273 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 6.5 | \n",
" 2.0 | \n",
" 6.5 | \n",
" 59.000000 | \n",
" 31.083333 | \n",
"
\n",
" \n",
" 2 | \n",
" 19.0 | \n",
" 2.0 | \n",
" 7.0 | \n",
" 52.923077 | \n",
" 48.769231 | \n",
"
\n",
" \n",
" 3 | \n",
" 32.5 | \n",
" 2.0 | \n",
" 6.5 | \n",
" 51.500000 | \n",
" 32.416667 | \n",
"
\n",
" \n",
" 4 | \n",
" 42.5 | \n",
" 2.0 | \n",
" 4.5 | \n",
" 59.250000 | \n",
" 39.125000 | \n",
"
\n",
" \n",
" 5 | \n",
" 49.0 | \n",
" 2.0 | \n",
" 3.0 | \n",
" 50.000000 | \n",
" 44.200000 | \n",
"
\n",
" \n",
" 6 | \n",
" 60.0 | \n",
" 2.0 | \n",
" 8.0 | \n",
" 51.866667 | \n",
" 44.066667 | \n",
"
\n",
" \n",
" 7 | \n",
" 73.5 | \n",
" 2.0 | \n",
" 6.5 | \n",
" 59.250000 | \n",
" 32.916667 | \n",
"
\n",
" \n",
" 8 | \n",
" 88.5 | \n",
" 2.0 | \n",
" 9.5 | \n",
" 58.562500 | \n",
" 29.500000 | \n",
"
\n",
" \n",
" 9 | \n",
" 103.7 | \n",
" 2.0 | \n",
" 5.7 | \n",
" 56.000000 | \n",
" 41.600000 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 6.5 | \n",
" 3.0 | \n",
" 6.5 | \n",
" 59.000000 | \n",
" 33.333333 | \n",
"
\n",
" \n",
" 2 | \n",
" 19.5 | \n",
" 3.0 | \n",
" 7.5 | \n",
" 53.928571 | \n",
" 48.785714 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" patient obs id age twstrs\n",
"week site \n",
"0 1 6.5 1.0 6.5 59.000000 43.083333\n",
" 2 19.5 1.0 7.5 53.928571 51.857143\n",
" 3 32.5 1.0 6.5 51.500000 38.750000\n",
" 4 42.5 1.0 4.5 59.250000 48.125000\n",
" 5 49.5 1.0 3.5 51.833333 49.333333\n",
" 6 60.0 1.0 8.0 51.866667 49.400000\n",
" 7 73.5 1.0 6.5 59.250000 44.333333\n",
" 8 89.0 1.0 10.0 57.263158 38.631579\n",
" 9 104.0 1.0 6.0 55.454545 52.727273\n",
"2 1 6.5 2.0 6.5 59.000000 31.083333\n",
" 2 19.0 2.0 7.0 52.923077 48.769231\n",
" 3 32.5 2.0 6.5 51.500000 32.416667\n",
" 4 42.5 2.0 4.5 59.250000 39.125000\n",
" 5 49.0 2.0 3.0 50.000000 44.200000\n",
" 6 60.0 2.0 8.0 51.866667 44.066667\n",
" 7 73.5 2.0 6.5 59.250000 32.916667\n",
" 8 88.5 2.0 9.5 58.562500 29.500000\n",
" 9 103.7 2.0 5.7 56.000000 41.600000\n",
"4 1 6.5 3.0 6.5 59.000000 33.333333\n",
" 2 19.5 3.0 7.5 53.928571 48.785714"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cdystonia.groupby(['week','site']).mean()"
]
},
{
"cell_type": "markdown",
"id": "838bc9d9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Alternately, we can **transform** the data, using a function of our choice with the `transform` method:"
]
},
{
"cell_type": "code",
"execution_count": 107,
"id": "684b5283",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" obs | \n",
" week | \n",
" twstrs | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.336306 | \n",
" -1.135550 | \n",
" -0.181369 | \n",
"
\n",
" \n",
" 1 | \n",
" -0.801784 | \n",
" -0.811107 | \n",
" -0.544107 | \n",
"
\n",
" \n",
" 2 | \n",
" -0.267261 | \n",
" -0.486664 | \n",
" -1.632322 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.267261 | \n",
" 0.162221 | \n",
" 0.725476 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.801784 | \n",
" 0.811107 | \n",
" 1.088214 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 626 | \n",
" -1.253831 | \n",
" -1.135467 | \n",
" 1.180487 | \n",
"
\n",
" \n",
" 627 | \n",
" -0.771589 | \n",
" -0.836660 | \n",
" -0.459078 | \n",
"
\n",
" \n",
" 628 | \n",
" 0.192897 | \n",
" 0.059761 | \n",
" -1.005600 | \n",
"
\n",
" \n",
" 629 | \n",
" 0.675140 | \n",
" 0.657376 | \n",
" -0.677687 | \n",
"
\n",
" \n",
" 630 | \n",
" 1.157383 | \n",
" 1.254990 | \n",
" 0.961878 | \n",
"
\n",
" \n",
"
\n",
"
631 rows × 3 columns
\n",
"
"
],
"text/plain": [
" obs week twstrs\n",
"0 -1.336306 -1.135550 -0.181369\n",
"1 -0.801784 -0.811107 -0.544107\n",
"2 -0.267261 -0.486664 -1.632322\n",
"3 0.267261 0.162221 0.725476\n",
"4 0.801784 0.811107 1.088214\n",
".. ... ... ...\n",
"626 -1.253831 -1.135467 1.180487\n",
"627 -0.771589 -0.836660 -0.459078\n",
"628 0.192897 0.059761 -1.005600\n",
"629 0.675140 0.657376 -0.677687\n",
"630 1.157383 1.254990 0.961878\n",
"\n",
"[631 rows x 3 columns]"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cdystonia2 = cdystonia_grouped[[\"obs\", \"week\", \"twstrs\"]]\n",
"normalize = lambda x: (x - x.mean())/x.std()\n",
"\n",
"cdystonia2.transform(normalize)"
]
},
{
"cell_type": "markdown",
"id": "171b910c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- It is easy to do column selection within `groupby` operations, if we are only interested split-apply-combine operations on a subset of columns:"
]
},
{
"cell_type": "code",
"execution_count": 100,
"id": "73001faa",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"patient\n",
"1 33.000000\n",
"2 47.666667\n",
"3 30.500000\n",
"4 60.000000\n",
"5 46.166667\n",
" ... \n",
"105 43.666667\n",
"106 67.666667\n",
"107 42.000000\n",
"108 52.333333\n",
"109 42.200000\n",
"Name: twstrs, Length: 109, dtype: float64"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cdystonia_grouped['twstrs'].mean()"
]
},
{
"cell_type": "markdown",
"id": "a05c2b75",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If you simply want to divide your DataFrame into chunks for later use, its easy to convert them into a dict so that they can be easily indexed out as needed:"
]
},
{
"cell_type": "code",
"execution_count": 101,
"id": "76648337",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" patient | \n",
" obs | \n",
" week | \n",
" site | \n",
" id | \n",
" treat | \n",
" age | \n",
" sex | \n",
" twstrs | \n",
"
\n",
" \n",
" \n",
" \n",
" 18 | \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 4 | \n",
" Placebo | \n",
" 59 | \n",
" F | \n",
" 53 | \n",
"
\n",
" \n",
" 19 | \n",
" 4 | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 4 | \n",
" Placebo | \n",
" 59 | \n",
" F | \n",
" 61 | \n",
"
\n",
" \n",
" 20 | \n",
" 4 | \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 4 | \n",
" Placebo | \n",
" 59 | \n",
" F | \n",
" 64 | \n",
"
\n",
" \n",
" 21 | \n",
" 4 | \n",
" 4 | \n",
" 8 | \n",
" 1 | \n",
" 4 | \n",
" Placebo | \n",
" 59 | \n",
" F | \n",
" 62 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" patient obs week site id treat age sex twstrs\n",
"18 4 1 0 1 4 Placebo 59 F 53\n",
"19 4 2 2 1 4 Placebo 59 F 61\n",
"20 4 3 4 1 4 Placebo 59 F 64\n",
"21 4 4 8 1 4 Placebo 59 F 62"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chunks = dict(list(cdystonia_grouped))\n",
"chunks[4]"
]
},
{
"cell_type": "markdown",
"id": "e19eaf3e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- By default, `groupby` groups by row, but we can specify the `axis` argument to change this. For example, we can group our columns by type this way:"
]
},
{
"cell_type": "code",
"execution_count": 102,
"id": "28465087",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{dtype('int64'): patient obs week site id age twstrs\n",
" 0 1 1 0 1 1 65 32\n",
" 1 1 2 2 1 1 65 30\n",
" 2 1 3 4 1 1 65 24\n",
" 3 1 4 8 1 1 65 37\n",
" 4 1 5 12 1 1 65 39\n",
" .. ... ... ... ... .. ... ...\n",
" 626 109 1 0 9 11 57 53\n",
" 627 109 2 2 9 11 57 38\n",
" 628 109 4 8 9 11 57 33\n",
" 629 109 5 12 9 11 57 36\n",
" 630 109 6 16 9 11 57 51\n",
" \n",
" [631 rows x 7 columns],\n",
" dtype('O'): treat sex\n",
" 0 5000U F\n",
" 1 5000U F\n",
" 2 5000U F\n",
" 3 5000U F\n",
" 4 5000U F\n",
" .. ... ..\n",
" 626 5000U M\n",
" 627 5000U M\n",
" 628 5000U M\n",
" 629 5000U M\n",
" 630 5000U M\n",
" \n",
" [631 rows x 2 columns]}"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict(list(cdystonia.groupby(cdystonia.dtypes, axis=1)))"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
},
"rise": {
"auto_select": "first",
"autolaunch": false,
"enable_chalkboard": true,
"start_slideshow_at": "selected",
"theme": "black"
}
},
"nbformat": 4,
"nbformat_minor": 5
}