Data Wrangling with Pandas¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/python

Introduction to Pandas¶

  • Python is a terrific platform for statistical data analysis partly because of the features of the language itself, but also because of a rich suite of 3rd party packages that provide robust and flexible data structures, efficient implementations of mathematical and statistical functions, and facitities for generating publication-quality graphics.

  • Pandas is at the top of the "scientific stack", because it allows data to be imported, manipulated and exported so easily. In contrast, NumPy supports the bottom of the stack with fundamental infrastructure for array operations, mathematical calculations, and random number generation.

  • We will cover both of these in some detail before getting down to the business of analyzing data.

  • Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

  • Pandas is well suited for:

    • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
    • Ordered and unordered (not necessarily fixed-frequency) time series data.
    • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
    • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features of Pandas:¶

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
In [1]:
# Install pandas within a terminal
# ! pip3 install pandas -U 
! pip install pandas -U 
Looking in indexes: https://mirrors.163.com/pypi/simple/
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.3.4)
Collecting pandas
  Using cached https://mirrors.163.com/pypi/packages/48/b4/1081d66b71c4dfc1bc1e19d6f2abbf93ed42f69df7703eb323742d45423e/pandas-1.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
  Using cached https://mirrors.163.com/pypi/packages/03/ea/98d488a4047b3fd8075b5c1e00469ad42d715e2c1e4fd15fa1ffaef8d635/pandas-1.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
Requirement already satisfied: pytz>=2017.3 in /usr/lib/python3/dist-packages (from pandas) (2021.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/lib/python3/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.21.2)
In [2]:
import pandas as pd

Pandas Data Structures: Series¶

A Series is a single vector of data (like a NumPy array) with an index that labels each element in the vector.

In [3]:
counts = pd.Series([632, 1638, 569, 115])
counts
Out[3]:
0     632
1    1638
2     569
3     115
dtype: int64
  • If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series, while the index is a pandas Index object.
In [4]:
counts.values
Out[4]:
array([ 632, 1638,  569,  115])
In [5]:
counts.index
Out[5]:
RangeIndex(start=0, stop=4, step=1)
  • We can assign meaningful labels to the index, if they are available:
In [6]:
bacteria = pd.Series([632, 1638, 569, 115], 
                     index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria
Out[6]:
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64
  • These labels can be used to refer to the values in the Series.
In [7]:
bacteria['Actinobacteria']
Out[7]:
569
In [27]:
bacteria.index
Out[27]:
Index(['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'], dtype='object', name='phylum')
In [9]:
[name.endswith('bacteria') for name in bacteria.index]
Out[9]:
[False, True, True, False]
In [8]:
bacteria[[name.endswith('bacteria') for name in bacteria.index]]
Out[8]:
Proteobacteria    1638
Actinobacteria     569
dtype: int64
  • Notice that the indexing operation preserved the association between the values and the corresponding indices. We can still use positional indexing if we wish.
In [10]:
bacteria[0]
Out[10]:
632
  • We can give both the array of values and the index meaningful labels themselves:
In [11]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria
Out[11]:
phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64
  • We can also filter according to the values in the Series:
In [12]:
bacteria[bacteria>1000]
Out[12]:
phylum
Proteobacteria    1638
Name: counts, dtype: int64
  • A Series can be thought of as an ordered key-value store. In fact, we can create one from a dict:
In [13]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
pd.Series(bacteria_dict)
Out[13]:
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64
  • Notice that the Series is created in key-sorted order. If we pass a custom index to Series, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the NaN (not a number) type for missing values.
In [14]:
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2
Out[14]:
Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64
In [15]:
bacteria2.isnull()
Out[15]:
Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool
  • Critically, the labels are used to align data when used in operations with other Series objects:
In [16]:
bacteria + bacteria2
Out[16]:
Actinobacteria    1138.0
Bacteroidetes        NaN
Cyanobacteria        NaN
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

Pandas Data Structures: DataFrame¶

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame allows us to represent and manipulate higher-dimensional data.

In [17]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                               'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 
                               'Actinobacteria', 'Bacteroidetes']})
data
Out[17]:
value patient phylum
0 632 1 Firmicutes
1 1638 1 Proteobacteria
2 569 1 Actinobacteria
3 115 1 Bacteroidetes
4 433 2 Firmicutes
5 1130 2 Proteobacteria
6 754 2 Actinobacteria
7 555 2 Bacteroidetes

Indexing¶

  • Notice the DataFrame is sorted by column name. We can change the order by indexing them in the order we desire:
In [18]:
data[['phylum','value','patient']]
Out[18]:
phylum value patient
0 Firmicutes 632 1
1 Proteobacteria 1638 1
2 Actinobacteria 569 1
3 Bacteroidetes 115 1
4 Firmicutes 433 2
5 Proteobacteria 1130 2
6 Actinobacteria 754 2
7 Bacteroidetes 555 2
  • A DataFrame has a second index, representing the columns:
In [19]:
data.columns
Out[19]:
Index(['value', 'patient', 'phylum'], dtype='object')
  • If we wish to access columns, we can do so either by dict-like indexing or by attribute:
In [20]:
data['value']
Out[20]:
0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64
In [21]:
data.value
Out[21]:
0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64
In [22]:
type(data.value)
Out[22]:
pandas.core.series.Series
In [23]:
type(data[['value']])
Out[23]:
pandas.core.frame.DataFrame
  • .iloc[ ] Purely integer-location based indexing for selection by position.

    • With a scalar integer will retrieve the given row
In [30]:
data.iloc[0]
Out[30]:
value             632
patient             1
phylum     Firmicutes
Name: 0, dtype: object
  • With a slice object for row selection. Note that stop of the slice are not included [2, 5).
In [64]:
data.iloc[2:5]
Out[64]:
value patient phylum
2 569 1 Actinobacteria
3 115 1 Bacteroidetes
4 433 2 Firmicutes
  • List Comprehensions is also possible for row selection
In [36]:
data.iloc[lambda x: x.index % 2 == 0]
Out[36]:
value patient phylum
0 632 1 Firmicutes
2 569 1 Actinobacteria
4 433 2 Firmicutes
6 754 2 Actinobacteria
  • With a list of integers will retrieve the given row
In [31]:
data.iloc[[1,2]]
Out[31]:
value patient phylum
1 1638 1 Proteobacteria
2 569 1 Actinobacteria
  • With a boolean list mask the same length as the index.
In [35]:
data.iloc[[True, False,True, False,True, False,True, False]]
Out[35]:
value patient phylum
0 632 1 Firmicutes
2 569 1 Actinobacteria
4 433 2 Firmicutes
6 754 2 Actinobacteria
  • Indexing both axes

You can mix the indexer types for the index and columns. Use : to select the entire axis.

In [38]:
data.iloc[0, 1] # With scalar integers
Out[38]:
1
In [43]:
data.iloc[[0, 4], [0, 2]] # With lists of integers.
Out[43]:
value phylum
0 632 Firmicutes
4 433 Firmicutes
In [45]:
data.iloc[1:3, 0:2]
Out[45]:
value patient
1 1638 1
2 569 1
  • .loc[ ] Access a group of rows and columns by label(s) or a boolean array.
In [47]:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=['cobra', 'viper', 'sidewinder'],
                  columns=['max_speed', 'shield'])
df
Out[47]:
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8
In [48]:
df.loc['viper']  # Single label returns the row as a Series.
Out[48]:
max_speed    4
shield       5
Name: viper, dtype: int64
  • Slice with labels for row and single label for column. Note that both the start and stop of the slice are included.
In [52]:
df.loc['cobra':'viper', 'max_speed']
Out[52]:
cobra    1
viper    4
Name: max_speed, dtype: int64
In [49]:
df.loc[['viper', 'sidewinder']] # List of labels returns a DataFrame.
Out[49]:
max_speed shield
viper 4 5
sidewinder 7 8
  • Boolean list with the same length as the row axis
In [53]:
df.loc[[False, False, True]]
Out[53]:
max_speed shield
sidewinder 7 8
  • Conditional that returns a boolean Series
In [55]:
df.loc[df['shield'] > 6]
Out[55]:
max_speed shield
sidewinder 7 8
  • Slice with integer labels for rows using .loc[ ]. Note that both the start and stop of the slice are included.
In [66]:
data
Out[66]:
value patient phylum
0 632 1 Firmicutes
1 1638 1 Proteobacteria
2 569 1 Actinobacteria
3 115 1 Bacteroidetes
4 433 2 Firmicutes
5 0 2 Proteobacteria
6 754 2 Actinobacteria
7 555 2 Bacteroidetes
In [68]:
data.loc[3:5]
Out[68]:
value patient phylum
3 115 1 Bacteroidetes
4 433 2 Firmicutes
5 0 2 Proteobacteria

Set values¶

In [60]:
# Set value for all items matching the list of labels
df.loc[['viper', 'sidewinder'], ['shield']] = 50
df
Out[60]:
max_speed shield
cobra 1 2
viper 4 50
sidewinder 7 50
In [61]:
df.loc['cobra'] = 10 # Set value for an entire row

df
Out[61]:
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50
In [62]:
df.loc[:, 'max_speed'] = 30 # Set value for an entire column
df
Out[62]:
max_speed shield
cobra 30 10
viper 30 50
sidewinder 30 50
  • It is important to note that the Series returned when a DataFrame is indexted is merely a view on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:
In [56]:
vals = data.value
vals
Out[56]:
0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64
In [57]:
vals[5] = 0
vals
/tmp/ipykernel_229928/1693880163.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vals[5] = 0
Out[57]:
0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: int64
In [58]:
data
Out[58]:
value patient phylum
0 632 1 Firmicutes
1 1638 1 Proteobacteria
2 569 1 Actinobacteria
3 115 1 Bacteroidetes
4 433 2 Firmicutes
5 0 2 Proteobacteria
6 754 2 Actinobacteria
7 555 2 Bacteroidetes
In [ ]:
vals = data.value.copy()
vals[5] = 1000
data
  • We can create or modify columns by assignment:
In [69]:
data.value[3] = 14
data
/tmp/ipykernel_229928/2998967180.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.value[3] = 14
Out[69]:
value patient phylum
0 632 1 Firmicutes
1 1638 1 Proteobacteria
2 569 1 Actinobacteria
3 14 1 Bacteroidetes
4 433 2 Firmicutes
5 0 2 Proteobacteria
6 754 2 Actinobacteria
7 555 2 Bacteroidetes
In [72]:
data['year'] = 2013
data
Out[72]:
value patient phylum year
0 632 1 Firmicutes 2013
1 1638 1 Proteobacteria 2013
2 569 1 Actinobacteria 2013
3 14 1 Bacteroidetes 2013
4 433 2 Firmicutes 2013
5 0 2 Proteobacteria 2013
6 754 2 Actinobacteria 2013
7 555 2 Bacteroidetes 2013
  • But note, we cannot use the attribute indexing method to add a new column:
In [73]:
data.treatment = 1
data
Out[73]:
value patient phylum year
0 632 1 Firmicutes 2013
1 1638 1 Proteobacteria 2013
2 569 1 Actinobacteria 2013
3 14 1 Bacteroidetes 2013
4 433 2 Firmicutes 2013
5 0 2 Proteobacteria 2013
6 754 2 Actinobacteria 2013
7 555 2 Bacteroidetes 2013
In [74]:
data.treatment
Out[74]:
1
  • Specifying a Series as a new columns cause its values to be added according to the DataFrame's index:
In [75]:
treatment = pd.Series([0]*4 + [1]*2)
treatment
Out[75]:
0    0
1    0
2    0
3    0
4    1
5    1
dtype: int64
In [76]:
data['treatment'] = treatment
data
Out[76]:
value patient phylum year treatment
0 632 1 Firmicutes 2013 0.0
1 1638 1 Proteobacteria 2013 0.0
2 569 1 Actinobacteria 2013 0.0
3 14 1 Bacteroidetes 2013 0.0
4 433 2 Firmicutes 2013 1.0
5 0 2 Proteobacteria 2013 1.0
6 754 2 Actinobacteria 2013 NaN
7 555 2 Bacteroidetes 2013 NaN
  • Other Python data structures (ones without an index) need to be the same length as the DataFrame. The following produces an error.
In [79]:
# month = ['Jan', 'Feb', 'Mar', 'Apr']
# data['month'] = month
  • We can extract the underlying data as a simple ndarray by accessing the values attribute:
In [80]:
data.values
Out[80]:
array([[632, 1, 'Firmicutes', 2013, 0.0],
       [1638, 1, 'Proteobacteria', 2013, 0.0],
       [569, 1, 'Actinobacteria', 2013, 0.0],
       [14, 1, 'Bacteroidetes', 2013, 0.0],
       [433, 2, 'Firmicutes', 2013, 1.0],
       [0, 2, 'Proteobacteria', 2013, 1.0],
       [754, 2, 'Actinobacteria', 2013, nan],
       [555, 2, 'Bacteroidetes', 2013, nan]], dtype=object)

Lab¶

  • What are the differences between .iloc[ ] and loc[ ]?
  • Within .iloc[ ] and loc[ ], when shall we (not) use [ ]?
  • When doing indexing, what is the sign : used for?