{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Data Wrangling with Pandas\n",
"\n",
"Feng Li\n",
"\n",
"School of Statistics and Mathematics\n",
"\n",
"Central University of Finance and Economics\n",
"\n",
"[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n",
"\n",
"[https://feng.li/python](https://feng.li/python)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"## Introduction to Pandas\n",
"\n",
"- Python is a terrific platform for statistical data analysis partly because of the features of the language itself, but also because of a rich suite of 3rd party packages that provide robust and flexible data structures, efficient implementations of mathematical and statistical functions, and facitities for generating publication-quality graphics. \n",
"\n",
"- Pandas is at the top of the \"scientific stack\", because it allows data to be imported, manipulated and exported so easily. In contrast, NumPy supports the bottom of the stack with fundamental infrastructure for array operations, mathematical calculations, and random number generation. \n",
"\n",
"- We will cover both of these in some detail before getting down to the business of analyzing data."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- **Pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. \n",
"\n",
"- Pandas is well suited for:\n",
"\n",
" - Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet\n",
" - Ordered and unordered (not necessarily fixed-frequency) time series data.\n",
" - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels\n",
" - Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Key features of Pandas:\n",
" \n",
"- Easy handling of **missing data**\n",
"- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects\n",
"- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically\n",
"- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets\n",
"- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets\n",
"- Intuitive **merging and joining** data sets\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Flexible **reshaping and pivoting** of data sets\n",
"- **Hierarchical labeling** of axes\n",
"- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5\n",
"- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in indexes: https://mirrors.163.com/pypi/simple/\n",
"Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.3.4)\n",
"Collecting pandas\n",
" Using cached https://mirrors.163.com/pypi/packages/48/b4/1081d66b71c4dfc1bc1e19d6f2abbf93ed42f69df7703eb323742d45423e/pandas-1.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)\n",
" Using cached https://mirrors.163.com/pypi/packages/03/ea/98d488a4047b3fd8075b5c1e00469ad42d715e2c1e4fd15fa1ffaef8d635/pandas-1.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)\n",
"Requirement already satisfied: pytz>=2017.3 in /usr/lib/python3/dist-packages (from pandas) (2021.3)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /usr/lib/python3/dist-packages (from pandas) (2.8.1)\n",
"Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.21.2)\n"
]
}
],
"source": [
"# Install pandas within a terminal\n",
"# ! pip3 install pandas -U \n",
"! pip install pandas -U "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas Data Structures: Series\n",
"\n",
"A **Series** is a single vector of data (like a NumPy array) with an *index* that labels each element in the vector."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 632\n",
"1 1638\n",
"2 569\n",
"3 115\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = pd.Series([632, 1638, 569, 115])\n",
"counts"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 632, 1638, 569, 115])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts.values"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=4, step=1)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts.index"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We can assign meaningful labels to the index, if they are available:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Firmicutes 632\n",
"Proteobacteria 1638\n",
"Actinobacteria 569\n",
"Bacteroidetes 115\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria = pd.Series([632, 1638, 569, 115], \n",
" index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])\n",
"\n",
"bacteria"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- These labels can be used to refer to the values in the `Series`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"569"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria['Actinobacteria']"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'], dtype='object', name='phylum')"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria.index"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[False, True, True, False]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name.endswith('bacteria') for name in bacteria.index]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Proteobacteria 1638\n",
"Actinobacteria 569\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria[[name.endswith('bacteria') for name in bacteria.index]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that the indexing operation preserved the association between the values and the corresponding indices. We can still use **positional indexing** if we wish."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"632"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We can give both the **array of values** and **the index** meaningful labels themselves:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"phylum\n",
"Firmicutes 632\n",
"Proteobacteria 1638\n",
"Actinobacteria 569\n",
"Bacteroidetes 115\n",
"Name: counts, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria.name = 'counts'\n",
"bacteria.index.name = 'phylum'\n",
"bacteria"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We can also filter according to the values in the `Series`:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"phylum\n",
"Proteobacteria 1638\n",
"Name: counts, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria[bacteria>1000]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- A `Series` can be thought of as an **ordered key-value** store. In fact, we can create one from a `dict`:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Firmicutes 632\n",
"Proteobacteria 1638\n",
"Actinobacteria 569\n",
"Bacteroidetes 115\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}\n",
"pd.Series(bacteria_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Notice that the `Series` is created in **key-sorted order**. If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Cyanobacteria NaN\n",
"Firmicutes 632.0\n",
"Proteobacteria 1638.0\n",
"Actinobacteria 569.0\n",
"dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])\n",
"bacteria2"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Cyanobacteria True\n",
"Firmicutes False\n",
"Proteobacteria False\n",
"Actinobacteria False\n",
"dtype: bool"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria2.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Critically, the labels are used to **align data** when used in operations with other Series objects:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Actinobacteria 1138.0\n",
"Bacteroidetes NaN\n",
"Cyanobacteria NaN\n",
"Firmicutes 1264.0\n",
"Proteobacteria 3276.0\n",
"dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bacteria + bacteria2"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas Data Structures: DataFrame\n",
"\n",
"Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.\n",
"\n",
"A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 3 | \n",
" 115 | \n",
" 1 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 5 | \n",
" 1130 | \n",
" 2 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"0 632 1 Firmicutes\n",
"1 1638 1 Proteobacteria\n",
"2 569 1 Actinobacteria\n",
"3 115 1 Bacteroidetes\n",
"4 433 2 Firmicutes\n",
"5 1130 2 Proteobacteria\n",
"6 754 2 Actinobacteria\n",
"7 555 2 Bacteroidetes"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],\n",
" 'patient':[1, 1, 1, 1, 2, 2, 2, 2],\n",
" 'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', \n",
" 'Bacteroidetes', 'Firmicutes', 'Proteobacteria', \n",
" 'Actinobacteria', 'Bacteroidetes']})\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Indexing"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" phylum | \n",
" value | \n",
" patient | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Firmicutes | \n",
" 632 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" Proteobacteria | \n",
" 1638 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Actinobacteria | \n",
" 569 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" Bacteroidetes | \n",
" 115 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" Firmicutes | \n",
" 433 | \n",
" 2 | \n",
"
\n",
" \n",
" 5 | \n",
" Proteobacteria | \n",
" 1130 | \n",
" 2 | \n",
"
\n",
" \n",
" 6 | \n",
" Actinobacteria | \n",
" 754 | \n",
" 2 | \n",
"
\n",
" \n",
" 7 | \n",
" Bacteroidetes | \n",
" 555 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" phylum value patient\n",
"0 Firmicutes 632 1\n",
"1 Proteobacteria 1638 1\n",
"2 Actinobacteria 569 1\n",
"3 Bacteroidetes 115 1\n",
"4 Firmicutes 433 2\n",
"5 Proteobacteria 1130 2\n",
"6 Actinobacteria 754 2\n",
"7 Bacteroidetes 555 2"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[['phylum','value','patient']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- A `DataFrame` has a second index, representing the columns:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['value', 'patient', 'phylum'], dtype='object')"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- If we wish to access columns, we can do so either **by dict-like indexing** or **by attribute**:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 632\n",
"1 1638\n",
"2 569\n",
"3 115\n",
"4 433\n",
"5 1130\n",
"6 754\n",
"7 555\n",
"Name: value, dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['value']"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 632\n",
"1 1638\n",
"2 569\n",
"3 115\n",
"4 433\n",
"5 1130\n",
"6 754\n",
"7 555\n",
"Name: value, dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.value"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data.value)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data[['value']])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- `.iloc[ ]` Purely integer-location based indexing for selection by position.\n",
"\n",
" - With a **scalar integer** will retrieve the given row"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"value 632\n",
"patient 1\n",
"phylum Firmicutes\n",
"Name: 0, dtype: object"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- With a slice object for row selection. Note that **stop of the slice are not included** `[2, 5)`."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 3 | \n",
" 115 | \n",
" 1 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"2 569 1 Actinobacteria\n",
"3 115 1 Bacteroidetes\n",
"4 433 2 Firmicutes"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[2:5]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- List Comprehensions is also possible for row selection"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"0 632 1 Firmicutes\n",
"2 569 1 Actinobacteria\n",
"4 433 2 Firmicutes\n",
"6 754 2 Actinobacteria"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[lambda x: x.index % 2 == 0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- With a **list of integers** will retrieve the given row"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"1 1638 1 Proteobacteria\n",
"2 569 1 Actinobacteria"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[[1,2]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- With **a boolean list** mask the same length as the index."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"0 632 1 Firmicutes\n",
"2 569 1 Actinobacteria\n",
"4 433 2 Firmicutes\n",
"6 754 2 Actinobacteria"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[[True, False,True, False,True, False,True, False]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- **Indexing both axes**\n",
"\n",
"You can mix the indexer types for the index and columns. Use `:` to select the entire axis."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[0, 1] # With scalar integers"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" Firmicutes | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value phylum\n",
"0 632 Firmicutes\n",
"4 433 Firmicutes"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[[0, 4], [0, 2]] # With lists of integers."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient\n",
"1 1638 1\n",
"2 569 1"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[1:3, 0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- `.loc[ ]` Access a group of rows and columns by label(s) or a boolean array."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" max_speed | \n",
" shield | \n",
"
\n",
" \n",
" \n",
" \n",
" cobra | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" viper | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
" sidewinder | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" max_speed shield\n",
"cobra 1 2\n",
"viper 4 5\n",
"sidewinder 7 8"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],\n",
" index=['cobra', 'viper', 'sidewinder'],\n",
" columns=['max_speed', 'shield'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"max_speed 4\n",
"shield 5\n",
"Name: viper, dtype: int64"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['viper'] # Single label returns the row as a Series."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Slice with labels for row and single label for column. Note that **both the start and stop of the slice are included**."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"cobra 1\n",
"viper 4\n",
"Name: max_speed, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['cobra':'viper', 'max_speed']"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" max_speed | \n",
" shield | \n",
"
\n",
" \n",
" \n",
" \n",
" viper | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
" sidewinder | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" max_speed shield\n",
"viper 4 5\n",
"sidewinder 7 8"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[['viper', 'sidewinder']] # List of labels returns a DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Boolean list with the same length as the row axis"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" max_speed | \n",
" shield | \n",
"
\n",
" \n",
" \n",
" \n",
" sidewinder | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" max_speed shield\n",
"sidewinder 7 8"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[[False, False, True]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Conditional that returns a boolean Series"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" max_speed | \n",
" shield | \n",
"
\n",
" \n",
" \n",
" \n",
" sidewinder | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" max_speed shield\n",
"sidewinder 7 8"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df['shield'] > 6]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Slice with integer labels for rows using `.loc[ ]`. Note that **both the start and stop of the slice are included**."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 3 | \n",
" 115 | \n",
" 1 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 2 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"0 632 1 Firmicutes\n",
"1 1638 1 Proteobacteria\n",
"2 569 1 Actinobacteria\n",
"3 115 1 Bacteroidetes\n",
"4 433 2 Firmicutes\n",
"5 0 2 Proteobacteria\n",
"6 754 2 Actinobacteria\n",
"7 555 2 Bacteroidetes"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 115 | \n",
" 1 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 2 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"3 115 1 Bacteroidetes\n",
"4 433 2 Firmicutes\n",
"5 0 2 Proteobacteria"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[3:5]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Set values"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" max_speed | \n",
" shield | \n",
"
\n",
" \n",
" \n",
" \n",
" cobra | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" viper | \n",
" 4 | \n",
" 50 | \n",
"
\n",
" \n",
" sidewinder | \n",
" 7 | \n",
" 50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" max_speed shield\n",
"cobra 1 2\n",
"viper 4 50\n",
"sidewinder 7 50"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Set value for all items matching the list of labels\n",
"df.loc[['viper', 'sidewinder'], ['shield']] = 50\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" max_speed | \n",
" shield | \n",
"
\n",
" \n",
" \n",
" \n",
" cobra | \n",
" 10 | \n",
" 10 | \n",
"
\n",
" \n",
" viper | \n",
" 4 | \n",
" 50 | \n",
"
\n",
" \n",
" sidewinder | \n",
" 7 | \n",
" 50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" max_speed shield\n",
"cobra 10 10\n",
"viper 4 50\n",
"sidewinder 7 50"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['cobra'] = 10 # Set value for an entire row\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" max_speed | \n",
" shield | \n",
"
\n",
" \n",
" \n",
" \n",
" cobra | \n",
" 30 | \n",
" 10 | \n",
"
\n",
" \n",
" viper | \n",
" 30 | \n",
" 50 | \n",
"
\n",
" \n",
" sidewinder | \n",
" 30 | \n",
" 50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" max_speed shield\n",
"cobra 30 10\n",
"viper 30 50\n",
"sidewinder 30 50"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[:, 'max_speed'] = 30 # Set value for an entire column\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- It is important to note that the Series returned when a DataFrame is indexted is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 632\n",
"1 1638\n",
"2 569\n",
"3 115\n",
"4 433\n",
"5 1130\n",
"6 754\n",
"7 555\n",
"Name: value, dtype: int64"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vals = data.value\n",
"vals"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_229928/1693880163.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" vals[5] = 0\n"
]
},
{
"data": {
"text/plain": [
"0 632\n",
"1 1638\n",
"2 569\n",
"3 115\n",
"4 433\n",
"5 0\n",
"6 754\n",
"7 555\n",
"Name: value, dtype: int64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vals[5] = 0\n",
"vals"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 3 | \n",
" 115 | \n",
" 1 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 2 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"0 632 1 Firmicutes\n",
"1 1638 1 Proteobacteria\n",
"2 569 1 Actinobacteria\n",
"3 115 1 Bacteroidetes\n",
"4 433 2 Firmicutes\n",
"5 0 2 Proteobacteria\n",
"6 754 2 Actinobacteria\n",
"7 555 2 Bacteroidetes"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"vals = data.value.copy()\n",
"vals[5] = 1000\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We can create or modify columns by assignment:"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_229928/2998967180.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" data.value[3] = 14\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 3 | \n",
" 14 | \n",
" 1 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 2 | \n",
" Proteobacteria | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2 | \n",
" Bacteroidetes | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum\n",
"0 632 1 Firmicutes\n",
"1 1638 1 Proteobacteria\n",
"2 569 1 Actinobacteria\n",
"3 14 1 Bacteroidetes\n",
"4 433 2 Firmicutes\n",
"5 0 2 Proteobacteria\n",
"6 754 2 Actinobacteria\n",
"7 555 2 Bacteroidetes"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.value[3] = 14\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
" 2013 | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 3 | \n",
" 14 | \n",
" 1 | \n",
" Bacteroidetes | \n",
" 2013 | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
" 2013 | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 2 | \n",
" Proteobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2 | \n",
" Bacteroidetes | \n",
" 2013 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum year\n",
"0 632 1 Firmicutes 2013\n",
"1 1638 1 Proteobacteria 2013\n",
"2 569 1 Actinobacteria 2013\n",
"3 14 1 Bacteroidetes 2013\n",
"4 433 2 Firmicutes 2013\n",
"5 0 2 Proteobacteria 2013\n",
"6 754 2 Actinobacteria 2013\n",
"7 555 2 Bacteroidetes 2013"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['year'] = 2013\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- But note, we cannot use the attribute indexing method to add a new column:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
" 2013 | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 3 | \n",
" 14 | \n",
" 1 | \n",
" Bacteroidetes | \n",
" 2013 | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
" 2013 | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 2 | \n",
" Proteobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
" 2013 | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2 | \n",
" Bacteroidetes | \n",
" 2013 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum year\n",
"0 632 1 Firmicutes 2013\n",
"1 1638 1 Proteobacteria 2013\n",
"2 569 1 Actinobacteria 2013\n",
"3 14 1 Bacteroidetes 2013\n",
"4 433 2 Firmicutes 2013\n",
"5 0 2 Proteobacteria 2013\n",
"6 754 2 Actinobacteria 2013\n",
"7 555 2 Bacteroidetes 2013"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.treatment = 1\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.treatment"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 1\n",
"5 1\n",
"dtype: int64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"treatment = pd.Series([0]*4 + [1]*2)\n",
"treatment"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" patient | \n",
" phylum | \n",
" year | \n",
" treatment | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 632 | \n",
" 1 | \n",
" Firmicutes | \n",
" 2013 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1638 | \n",
" 1 | \n",
" Proteobacteria | \n",
" 2013 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 569 | \n",
" 1 | \n",
" Actinobacteria | \n",
" 2013 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 14 | \n",
" 1 | \n",
" Bacteroidetes | \n",
" 2013 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 433 | \n",
" 2 | \n",
" Firmicutes | \n",
" 2013 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 2 | \n",
" Proteobacteria | \n",
" 2013 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 754 | \n",
" 2 | \n",
" Actinobacteria | \n",
" 2013 | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 555 | \n",
" 2 | \n",
" Bacteroidetes | \n",
" 2013 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value patient phylum year treatment\n",
"0 632 1 Firmicutes 2013 0.0\n",
"1 1638 1 Proteobacteria 2013 0.0\n",
"2 569 1 Actinobacteria 2013 0.0\n",
"3 14 1 Bacteroidetes 2013 0.0\n",
"4 433 2 Firmicutes 2013 1.0\n",
"5 0 2 Proteobacteria 2013 1.0\n",
"6 754 2 Actinobacteria 2013 NaN\n",
"7 555 2 Bacteroidetes 2013 NaN"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['treatment'] = treatment\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Other Python data structures (ones without an index) need to be the same length as the `DataFrame`. The following produces an error."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# month = ['Jan', 'Feb', 'Mar', 'Apr']\n",
"# data['month'] = month"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[632, 1, 'Firmicutes', 2013, 0.0],\n",
" [1638, 1, 'Proteobacteria', 2013, 0.0],\n",
" [569, 1, 'Actinobacteria', 2013, 0.0],\n",
" [14, 1, 'Bacteroidetes', 2013, 0.0],\n",
" [433, 2, 'Firmicutes', 2013, 1.0],\n",
" [0, 2, 'Proteobacteria', 2013, 1.0],\n",
" [754, 2, 'Actinobacteria', 2013, nan],\n",
" [555, 2, 'Bacteroidetes', 2013, nan]], dtype=object)"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.values"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Lab\n",
"\n",
"- What are the differences between `.iloc[ ]` and ``loc[ ]``?\n",
"- Within `.iloc[ ]` and ``loc[ ]``, when shall we (not) use `[ ]`? \n",
"- When doing indexing, what is the sign `:` used for?\n"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
},
"rise": {
"auto_select": "first",
"autolaunch": false,
"enable_chalkboard": true,
"start_slideshow_at": "selected",
"theme": "black"
}
},
"nbformat": 4,
"nbformat_minor": 1
}