Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
Python is a terrific platform for statistical data analysis partly because of the features of the language itself, but also because of a rich suite of 3rd party packages that provide robust and flexible data structures, efficient implementations of mathematical and statistical functions, and facitities for generating publication-quality graphics.
Pandas is at the top of the "scientific stack", because it allows data to be imported, manipulated and exported so easily. In contrast, NumPy supports the bottom of the stack with fundamental infrastructure for array operations, mathematical calculations, and random number generation.
We will cover both of these in some detail before getting down to the business of analyzing data.
Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
Pandas is well suited for:
# Install pandas within a terminal
# ! pip3 install pandas -U
! pip install pandas -U
Looking in indexes: https://mirrors.163.com/pypi/simple/ Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.3.4) Collecting pandas Using cached https://mirrors.163.com/pypi/packages/48/b4/1081d66b71c4dfc1bc1e19d6f2abbf93ed42f69df7703eb323742d45423e/pandas-1.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB) Using cached https://mirrors.163.com/pypi/packages/03/ea/98d488a4047b3fd8075b5c1e00469ad42d715e2c1e4fd15fa1ffaef8d635/pandas-1.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB) Requirement already satisfied: pytz>=2017.3 in /usr/lib/python3/dist-packages (from pandas) (2021.3) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/lib/python3/dist-packages (from pandas) (2.8.1) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.21.2)
import pandas as pd
A Series is a single vector of data (like a NumPy array) with an index that labels each element in the vector.
counts = pd.Series([632, 1638, 569, 115])
counts
0 632 1 1638 2 569 3 115 dtype: int64
Series
, while the index is a pandas Index
object.counts.values
array([ 632, 1638, 569, 115])
counts.index
RangeIndex(start=0, stop=4, step=1)
bacteria = pd.Series([632, 1638, 569, 115],
index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])
bacteria
Firmicutes 632 Proteobacteria 1638 Actinobacteria 569 Bacteroidetes 115 dtype: int64
Series
.bacteria['Actinobacteria']
569
bacteria.index
Index(['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'], dtype='object', name='phylum')
[name.endswith('bacteria') for name in bacteria.index]
[False, True, True, False]
bacteria[[name.endswith('bacteria') for name in bacteria.index]]
Proteobacteria 1638 Actinobacteria 569 dtype: int64
bacteria[0]
632
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria
phylum Firmicutes 632 Proteobacteria 1638 Actinobacteria 569 Bacteroidetes 115 Name: counts, dtype: int64
Series
:bacteria[bacteria>1000]
phylum Proteobacteria 1638 Name: counts, dtype: int64
Series
can be thought of as an ordered key-value store. In fact, we can create one from a dict
:bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
pd.Series(bacteria_dict)
Firmicutes 632 Proteobacteria 1638 Actinobacteria 569 Bacteroidetes 115 dtype: int64
Series
is created in key-sorted order. If we pass a custom index to Series
, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the NaN
(not a number) type for missing values.bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2
Cyanobacteria NaN Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
bacteria2.isnull()
Cyanobacteria True Firmicutes False Proteobacteria False Actinobacteria False dtype: bool
bacteria + bacteria2
Actinobacteria 1138.0 Bacteroidetes NaN Cyanobacteria NaN Firmicutes 1264.0 Proteobacteria 3276.0 dtype: float64
Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.
A DataFrame
is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame
allows us to represent and manipulate higher-dimensional data.
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
'patient':[1, 1, 1, 1, 2, 2, 2, 2],
'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria',
'Bacteroidetes', 'Firmicutes', 'Proteobacteria',
'Actinobacteria', 'Bacteroidetes']})
data
value | patient | phylum | |
---|---|---|---|
0 | 632 | 1 | Firmicutes |
1 | 1638 | 1 | Proteobacteria |
2 | 569 | 1 | Actinobacteria |
3 | 115 | 1 | Bacteroidetes |
4 | 433 | 2 | Firmicutes |
5 | 1130 | 2 | Proteobacteria |
6 | 754 | 2 | Actinobacteria |
7 | 555 | 2 | Bacteroidetes |
DataFrame
is sorted by column name. We can change the order by indexing them in the order we desire:data[['phylum','value','patient']]
phylum | value | patient | |
---|---|---|---|
0 | Firmicutes | 632 | 1 |
1 | Proteobacteria | 1638 | 1 |
2 | Actinobacteria | 569 | 1 |
3 | Bacteroidetes | 115 | 1 |
4 | Firmicutes | 433 | 2 |
5 | Proteobacteria | 1130 | 2 |
6 | Actinobacteria | 754 | 2 |
7 | Bacteroidetes | 555 | 2 |
DataFrame
has a second index, representing the columns:data.columns
Index(['value', 'patient', 'phylum'], dtype='object')
data['value']
0 632 1 1638 2 569 3 115 4 433 5 1130 6 754 7 555 Name: value, dtype: int64
data.value
0 632 1 1638 2 569 3 115 4 433 5 1130 6 754 7 555 Name: value, dtype: int64
type(data.value)
pandas.core.series.Series
type(data[['value']])
pandas.core.frame.DataFrame
.iloc[ ]
Purely integer-location based indexing for selection by position.
data.iloc[0]
value 632 patient 1 phylum Firmicutes Name: 0, dtype: object
[2, 5)
.data.iloc[2:5]
value | patient | phylum | |
---|---|---|---|
2 | 569 | 1 | Actinobacteria |
3 | 115 | 1 | Bacteroidetes |
4 | 433 | 2 | Firmicutes |
data.iloc[lambda x: x.index % 2 == 0]
value | patient | phylum | |
---|---|---|---|
0 | 632 | 1 | Firmicutes |
2 | 569 | 1 | Actinobacteria |
4 | 433 | 2 | Firmicutes |
6 | 754 | 2 | Actinobacteria |
data.iloc[[1,2]]
value | patient | phylum | |
---|---|---|---|
1 | 1638 | 1 | Proteobacteria |
2 | 569 | 1 | Actinobacteria |
data.iloc[[True, False,True, False,True, False,True, False]]
value | patient | phylum | |
---|---|---|---|
0 | 632 | 1 | Firmicutes |
2 | 569 | 1 | Actinobacteria |
4 | 433 | 2 | Firmicutes |
6 | 754 | 2 | Actinobacteria |
You can mix the indexer types for the index and columns. Use :
to select the entire axis.
data.iloc[0, 1] # With scalar integers
1
data.iloc[[0, 4], [0, 2]] # With lists of integers.
value | phylum | |
---|---|---|
0 | 632 | Firmicutes |
4 | 433 | Firmicutes |
data.iloc[1:3, 0:2]
value | patient | |
---|---|---|
1 | 1638 | 1 |
2 | 569 | 1 |
.loc[ ]
Access a group of rows and columns by label(s) or a boolean array.df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
df
max_speed | shield | |
---|---|---|
cobra | 1 | 2 |
viper | 4 | 5 |
sidewinder | 7 | 8 |
df.loc['viper'] # Single label returns the row as a Series.
max_speed 4 shield 5 Name: viper, dtype: int64
df.loc['cobra':'viper', 'max_speed']
cobra 1 viper 4 Name: max_speed, dtype: int64
df.loc[['viper', 'sidewinder']] # List of labels returns a DataFrame.
max_speed | shield | |
---|---|---|
viper | 4 | 5 |
sidewinder | 7 | 8 |
df.loc[[False, False, True]]
max_speed | shield | |
---|---|---|
sidewinder | 7 | 8 |
df.loc[df['shield'] > 6]
max_speed | shield | |
---|---|---|
sidewinder | 7 | 8 |
.loc[ ]
. Note that both the start and stop of the slice are included.data
value | patient | phylum | |
---|---|---|---|
0 | 632 | 1 | Firmicutes |
1 | 1638 | 1 | Proteobacteria |
2 | 569 | 1 | Actinobacteria |
3 | 115 | 1 | Bacteroidetes |
4 | 433 | 2 | Firmicutes |
5 | 0 | 2 | Proteobacteria |
6 | 754 | 2 | Actinobacteria |
7 | 555 | 2 | Bacteroidetes |
data.loc[3:5]
value | patient | phylum | |
---|---|---|---|
3 | 115 | 1 | Bacteroidetes |
4 | 433 | 2 | Firmicutes |
5 | 0 | 2 | Proteobacteria |
# Set value for all items matching the list of labels
df.loc[['viper', 'sidewinder'], ['shield']] = 50
df
max_speed | shield | |
---|---|---|
cobra | 1 | 2 |
viper | 4 | 50 |
sidewinder | 7 | 50 |
df.loc['cobra'] = 10 # Set value for an entire row
df
max_speed | shield | |
---|---|---|
cobra | 10 | 10 |
viper | 4 | 50 |
sidewinder | 7 | 50 |
df.loc[:, 'max_speed'] = 30 # Set value for an entire column
df
max_speed | shield | |
---|---|---|
cobra | 30 | 10 |
viper | 30 | 50 |
sidewinder | 30 | 50 |
vals = data.value
vals
0 632 1 1638 2 569 3 115 4 433 5 1130 6 754 7 555 Name: value, dtype: int64
vals[5] = 0
vals
/tmp/ipykernel_229928/1693880163.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy vals[5] = 0
0 632 1 1638 2 569 3 115 4 433 5 0 6 754 7 555 Name: value, dtype: int64
data
value | patient | phylum | |
---|---|---|---|
0 | 632 | 1 | Firmicutes |
1 | 1638 | 1 | Proteobacteria |
2 | 569 | 1 | Actinobacteria |
3 | 115 | 1 | Bacteroidetes |
4 | 433 | 2 | Firmicutes |
5 | 0 | 2 | Proteobacteria |
6 | 754 | 2 | Actinobacteria |
7 | 555 | 2 | Bacteroidetes |
vals = data.value.copy()
vals[5] = 1000
data
data.value[3] = 14
data
/tmp/ipykernel_229928/2998967180.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data.value[3] = 14
value | patient | phylum | |
---|---|---|---|
0 | 632 | 1 | Firmicutes |
1 | 1638 | 1 | Proteobacteria |
2 | 569 | 1 | Actinobacteria |
3 | 14 | 1 | Bacteroidetes |
4 | 433 | 2 | Firmicutes |
5 | 0 | 2 | Proteobacteria |
6 | 754 | 2 | Actinobacteria |
7 | 555 | 2 | Bacteroidetes |
data['year'] = 2013
data
value | patient | phylum | year | |
---|---|---|---|---|
0 | 632 | 1 | Firmicutes | 2013 |
1 | 1638 | 1 | Proteobacteria | 2013 |
2 | 569 | 1 | Actinobacteria | 2013 |
3 | 14 | 1 | Bacteroidetes | 2013 |
4 | 433 | 2 | Firmicutes | 2013 |
5 | 0 | 2 | Proteobacteria | 2013 |
6 | 754 | 2 | Actinobacteria | 2013 |
7 | 555 | 2 | Bacteroidetes | 2013 |
data.treatment = 1
data
value | patient | phylum | year | |
---|---|---|---|---|
0 | 632 | 1 | Firmicutes | 2013 |
1 | 1638 | 1 | Proteobacteria | 2013 |
2 | 569 | 1 | Actinobacteria | 2013 |
3 | 14 | 1 | Bacteroidetes | 2013 |
4 | 433 | 2 | Firmicutes | 2013 |
5 | 0 | 2 | Proteobacteria | 2013 |
6 | 754 | 2 | Actinobacteria | 2013 |
7 | 555 | 2 | Bacteroidetes | 2013 |
data.treatment
1
Series
as a new columns cause its values to be added according to the DataFrame
's index:treatment = pd.Series([0]*4 + [1]*2)
treatment
0 0 1 0 2 0 3 0 4 1 5 1 dtype: int64
data['treatment'] = treatment
data
value | patient | phylum | year | treatment | |
---|---|---|---|---|---|
0 | 632 | 1 | Firmicutes | 2013 | 0.0 |
1 | 1638 | 1 | Proteobacteria | 2013 | 0.0 |
2 | 569 | 1 | Actinobacteria | 2013 | 0.0 |
3 | 14 | 1 | Bacteroidetes | 2013 | 0.0 |
4 | 433 | 2 | Firmicutes | 2013 | 1.0 |
5 | 0 | 2 | Proteobacteria | 2013 | 1.0 |
6 | 754 | 2 | Actinobacteria | 2013 | NaN |
7 | 555 | 2 | Bacteroidetes | 2013 | NaN |
DataFrame
. The following produces an error.# month = ['Jan', 'Feb', 'Mar', 'Apr']
# data['month'] = month
ndarray
by accessing the values
attribute:data.values
array([[632, 1, 'Firmicutes', 2013, 0.0], [1638, 1, 'Proteobacteria', 2013, 0.0], [569, 1, 'Actinobacteria', 2013, 0.0], [14, 1, 'Bacteroidetes', 2013, 0.0], [433, 2, 'Firmicutes', 2013, 1.0], [0, 2, 'Proteobacteria', 2013, 1.0], [754, 2, 'Actinobacteria', 2013, nan], [555, 2, 'Bacteroidetes', 2013, nan]], dtype=object)
.iloc[ ]
and loc[ ]
?.iloc[ ]
and loc[ ]
, when shall we (not) use [ ]
?:
used for?