Manipulating DataFrames with Pandas¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/python

Importing data¶

  • A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze.

  • Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a DataFrame object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Read extermal data¶

  • There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries.

  • These include

    • CSV: read_csv()
    • Excel: read_excel()
    • JSON: read_json()
    • Parquet Format: read_parquet()
    • Stata: read_stata()
    • ...

These are beyond the scope of this tutorial, but are covered in https://pandas.pydata.org/docs/user_guide/io.html .

Let's start with some more bacteria data, stored in csv format.

In [1]:
! head data/microbiome.csv
Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,80
  • This table can be read into a DataFrame using read_csv:
In [2]:
import pandas as pd
mb = pd.read_csv("microbiome.csv")
mb 
Out[2]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 1174 703
3 Firmicutes 4 408 3946
4 Firmicutes 5 831 8605
... ... ... ... ...
70 Other 11 203 6
71 Other 12 392 6
72 Other 13 28 25
73 Other 14 12 22
74 Other 15 305 32

75 rows × 4 columns

  • Notice that read_csv automatically considered the first row in the file to be a header row. We can override default behavior by customizing some the arguments, like header, names or index_col.
In [3]:
pd.read_csv("data/microbiome.csv", header=None)
Out[3]:
0 1 2 3
0 Taxon Patient Tissue Stool
1 Firmicutes 1 632 305
2 Firmicutes 2 136 4182
3 Firmicutes 3 1174 703
4 Firmicutes 4 408 3946
... ... ... ... ...
71 Other 11 203 6
72 Other 12 392 6
73 Other 13 28 25
74 Other 14 12 22
75 Other 15 305 32

76 rows × 4 columns

  • For a more useful index, we can specify the first two columns, which together provide a unique index to the data.
In [4]:
mb = pd.read_csv("data/microbiome.csv", index_col=['Taxon','Patient'])
mb
Out[4]:
Tissue Stool
Taxon Patient
Firmicutes 1 632 305
2 136 4182
3 1174 703
4 408 3946
5 831 8605
... ... ... ...
Other 11 203 6
12 392 6
13 28 25
14 12 22
15 305 32

75 rows × 2 columns

  • This is called a hierarchical index, which we will revisit later in the tutorial.
  • If we have sections of data that we do not wish to import (for example, known bad data), we can populate the skiprows argument. This is useful for large dataset.
In [5]:
pd.read_csv("data/microbiome.csv", skiprows=[3,4,6])
Out[5]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 5 831 8605
3 Firmicutes 7 718 717
4 Firmicutes 8 173 33
... ... ... ... ...
67 Other 11 203 6
68 Other 12 392 6
69 Other 13 28 25
70 Other 14 12 22
71 Other 15 305 32

72 rows × 4 columns

  • Conversely, if we only want to import a small number of rows from, say, a very large data file we can use nrows to retrive the first nrows.
In [6]:
pd.read_csv("data/microbiome.csv", nrows=4)
Out[6]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305
1 Firmicutes 2 136 4182
2 Firmicutes 3 1174 703
3 Firmicutes 4 408 3946
  • Alternately, if we want to process our data in reasonable chunks, the chunksize argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:
In [7]:
data_chunks = pd.read_csv("data/microbiome.csv", chunksize=15)
data_chunks
Out[7]:
<pandas.io.parsers.readers.TextFileReader at 0x7f4741161460>

Missing values¶

  • Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including NA and NULL.
In [8]:
!head data/microbiome_missing.csv
Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,NA
In [9]:
import pandas as pd
pd.read_csv("data/microbiome_missing.csv").head(20)
Out[9]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632 305.0
1 Firmicutes 2 136 4182.0
2 Firmicutes 3 NaN 703.0
3 Firmicutes 4 408 3946.0
4 Firmicutes 5 831 8605.0
5 Firmicutes 6 693 50.0
6 Firmicutes 7 718 717.0
7 Firmicutes 8 173 33.0
8 Firmicutes 9 228 NaN
9 Firmicutes 10 162 3196.0
10 Firmicutes 11 372 -99999.0
11 Firmicutes 12 4255 4361.0
12 Firmicutes 13 107 1667.0
13 Firmicutes 14 ? 223.0
14 Firmicutes 15 281 2377.0
15 Proteobacteria 1 1638 3886.0
16 Proteobacteria 2 2469 1821.0
17 Proteobacteria 3 839 661.0
18 Proteobacteria 4 4414 18.0
19 Proteobacteria 5 12044 83.0

Above, Pandas recognized NA and an empty field as missing data.

In [10]:
pd.isnull(pd.read_csv("data/microbiome_missing.csv")).head(20)
Out[10]:
Taxon Patient Tissue Stool
0 False False False False
1 False False False False
2 False False True False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 False False False False
15 False False False False
16 False False False False
17 False False False False
18 False False False False
19 False False False False

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the na_values argument:

In [11]:
pd.read_csv("data/microbiome_missing.csv", na_values=['?', -99999]).head(20)
Out[11]:
Taxon Patient Tissue Stool
0 Firmicutes 1 632.0 305.0
1 Firmicutes 2 136.0 4182.0
2 Firmicutes 3 NaN 703.0
3 Firmicutes 4 408.0 3946.0
4 Firmicutes 5 831.0 8605.0
5 Firmicutes 6 693.0 50.0
6 Firmicutes 7 718.0 717.0
7 Firmicutes 8 173.0 33.0
8 Firmicutes 9 228.0 NaN
9 Firmicutes 10 162.0 3196.0
10 Firmicutes 11 372.0 NaN
11 Firmicutes 12 4255.0 4361.0
12 Firmicutes 13 107.0 1667.0
13 Firmicutes 14 NaN 223.0
14 Firmicutes 15 281.0 2377.0
15 Proteobacteria 1 1638.0 3886.0
16 Proteobacteria 2 2469.0 1821.0
17 Proteobacteria 3 839.0 661.0
18 Proteobacteria 4 4414.0 18.0
19 Proteobacteria 5 12044.0 83.0

These can be specified on a column-wise basis using an appropriate dict as the argument for na_values.

Manipulating indices¶

Reindexing allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.

For some variety, we will leave our digestive tract bacteria behind and employ some baseball data.

  • Specify an unique index
In [12]:
baseball = pd.read_csv("data/baseball.csv", index_col='id')
baseball
Out[12]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

100 rows × 22 columns

  • Notice that we specified the id column as the index, since it appears to be an unique identifier. We could try to create a unique index ourselves by combining player and year:
In [13]:
player_id = baseball.player + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_id
baseball_newind
Out[13]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
schilcu012006 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
myersmi012006 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
helliri012006 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
johnsra052006 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

100 rows × 22 columns

In [14]:
baseball_newind.index.is_unique
Out[14]:
False
  • So, indices need not be unique. Our choice is not unique because some players change teams within years. The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:
In [15]:
reverse_index = baseball.index[::-1]
baseball.reindex(reverse_index)
Out[15]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
89526 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
89525 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0

100 rows × 22 columns

  • Notice that the id index is not sequential. Say we wanted to populate the table with every id value. We could specify and index that is a sequence from the first to the last id numbers in the database, and Pandas would fill in the missing data with NaN values:
In [16]:
id_range = range(baseball.index.values.min(), baseball.index.values.max())
baseball.reindex(id_range)
Out[16]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 womacto01 2006.0 2.0 CHN NL 19.0 50.0 6.0 14.0 1.0 ... 2.0 1.0 1.0 4.0 4.0 0.0 0.0 3.0 0.0 0.0
88642 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88643 schilcu01 2006.0 1.0 BOS AL 31.0 2.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
88644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88645 myersmi01 2006.0 1.0 NYA AL 62.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89529 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89530 ausmubr01 2007.0 1.0 HOU NL 117.0 349.0 38.0 82.0 16.0 ... 25.0 6.0 1.0 37.0 74.0 3.0 6.0 4.0 1.0 11.0
89531 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89532 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89533 aloumo01 2007.0 1.0 NYN NL 87.0 328.0 51.0 112.0 19.0 ... 49.0 3.0 0.0 27.0 30.0 5.0 2.0 0.0 3.0 13.0

893 rows × 22 columns

In [17]:
baseball.reindex(id_range, fill_value='mr.nobody', columns=['player'])
Out[17]:
player
id
88641 womacto01
88642 mr.nobody
88643 schilcu01
88644 mr.nobody
88645 myersmi01
... ...
89529 mr.nobody
89530 ausmubr01
89531 mr.nobody
89532 mr.nobody
89533 aloumo01

893 rows × 1 columns

  • Index can also be sorted
In [18]:
baseball_newind.sort_index()
Out[18]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
williwo022007 williwo02 2007 1 HOU NL 33 59 3 6 0 ... 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0
witasja012007 witasja01 2007 1 TBA AL 3 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
zaungr012007 zaungr01 2007 1 TOR AL 110 331 43 80 24 ... 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0

100 rows × 22 columns

In [19]:
baseball_newind.sort_index(ascending=False)
Out[19]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
zaungr012007 zaungr01 2007 1 TOR AL 110 331 43 80 24 ... 52.0 0.0 0.0 51 55.0 8.0 2.0 1.0 6.0 9.0
womacto012006 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
witasja012007 witasja01 2007 1 TBA AL 3 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
williwo022007 williwo02 2007 1 HOU NL 33 59 3 6 0 ... 2.0 0.0 0.0 0 25.0 0.0 0.0 5.0 0.0 1.0
wickmbo012007 wickmbo01 2007 1 ATL NL 47 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
benitar012007 benitar01 2007 2 FLO NL 34 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
benitar012007 benitar01 2007 1 SFN NL 19 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
ausmubr012007 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
aloumo012007 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
alomasa022007 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

100 rows × 22 columns

In [20]:
baseball_newind.sort_index(axis=1)
Out[20]:
X2b X3b ab bb cs g gidp h hbp hr ... player r rbi sb sf sh so stint team year
womacto012006 1 0 50 4 1.0 19 0.0 14 0.0 1 ... womacto01 6 2.0 1.0 0.0 3.0 4.0 2 CHN 2006
schilcu012006 0 0 2 0 0.0 31 0.0 1 0.0 0 ... schilcu01 0 0.0 0.0 0.0 0.0 1.0 1 BOS 2006
myersmi012006 0 0 0 0 0.0 62 0.0 0 0.0 0 ... myersmi01 0 0.0 0.0 0.0 0.0 0.0 1 NYA 2006
helliri012006 0 0 3 0 0.0 20 0.0 0 0.0 0 ... helliri01 0 0.0 0.0 0.0 0.0 2.0 1 MIL 2006
johnsra052006 0 0 6 0 0.0 33 0.0 1 0.0 0 ... johnsra05 0 0.0 0.0 0.0 0.0 4.0 1 NYA 2006
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
benitar012007 0 0 0 0 0.0 34 0.0 0 0.0 0 ... benitar01 0 0.0 0.0 0.0 0.0 0.0 2 FLO 2007
benitar012007 0 0 0 0 0.0 19 0.0 0 0.0 0 ... benitar01 0 0.0 0.0 0.0 0.0 0.0 1 SFN 2007
ausmubr012007 16 3 349 37 1.0 117 11.0 82 6.0 3 ... ausmubr01 38 25.0 6.0 1.0 4.0 74.0 1 HOU 2007
aloumo012007 19 1 328 27 0.0 87 13.0 112 2.0 13 ... aloumo01 51 49.0 3.0 3.0 0.0 30.0 1 NYN 2007
alomasa022007 1 0 22 0 0.0 8 0.0 3 0.0 0 ... alomasa02 1 0.0 0.0 0.0 0.0 3.0 1 NYN 2007

100 rows × 22 columns

  • Ranking does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.
In [21]:
baseball.hr.rank()
Out[21]:
id
88641    62.5
88643    29.0
88645    29.0
88649    29.0
88650    29.0
         ... 
89525    29.0
89526    29.0
89530    71.5
89533    88.0
89534    29.0
Name: hr, Length: 100, dtype: float64
  • Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:
In [22]:
baseball.hr.rank(method='first')
Out[22]:
id
88641    58.0
88643     1.0
88645     2.0
88649     3.0
88650     4.0
         ... 
89525    55.0
89526    56.0
89530    72.0
89533    88.0
89534    57.0
Name: hr, Length: 100, dtype: float64
  • Calling the DataFrame's rank method results in the ranks of all columns:
In [23]:
baseball.rank(ascending=False)
Out[23]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 2.0 96.5 7.0 82.0 31.5 70.0 47.5 40.5 39.0 50.5 ... 51.0 24.5 17.5 44.5 59.0 66.0 65.5 16.0 70.0 76.5
88643 37.5 96.5 57.0 88.0 81.5 55.5 73.0 81.0 63.5 78.0 ... 78.5 63.5 62.5 79.0 73.0 66.0 65.5 67.5 70.0 76.5
88645 47.5 96.5 57.0 40.5 81.5 36.0 91.0 81.0 84.5 78.0 ... 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5
88649 66.0 96.5 57.0 47.0 31.5 67.5 69.0 81.0 84.5 78.0 ... 78.5 63.5 62.5 79.0 67.0 66.0 65.5 67.5 70.0 76.5
88650 61.5 96.5 57.0 40.5 81.5 51.0 64.5 81.0 63.5 78.0 ... 78.5 63.5 62.5 79.0 59.0 66.0 65.5 67.5 70.0 76.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89525 96.5 46.5 7.0 64.0 31.5 47.0 91.0 81.0 84.5 78.0 ... 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5
89526 96.5 46.5 57.0 13.5 31.5 70.0 91.0 81.0 84.5 78.0 ... 78.5 63.5 62.5 79.0 89.0 66.0 65.5 67.5 70.0 76.5
89530 98.0 46.5 57.0 61.5 31.5 17.5 19.0 24.0 23.0 21.5 ... 27.0 7.0 17.5 18.5 10.0 18.0 6.5 12.0 33.5 14.0
89533 99.0 46.5 57.0 31.5 31.5 23.0 22.0 18.5 14.0 17.5 ... 18.0 14.0 62.5 22.0 27.0 11.0 21.0 67.5 15.5 10.5
89534 100.0 46.5 57.0 31.5 31.5 77.0 57.0 58.0 58.0 50.5 ... 78.5 63.5 62.5 79.0 63.5 66.0 65.5 67.5 70.0 76.5

100 rows × 22 columns

In [24]:
baseball[['r','h','hr']].rank(ascending=False)
Out[24]:
r h hr
id
88641 40.5 39.0 38.5
88643 81.0 63.5 72.0
88645 81.0 84.5 72.0
88649 81.0 84.5 72.0
88650 81.0 63.5 72.0
... ... ... ...
89525 81.0 84.5 72.0
89526 81.0 84.5 72.0
89530 24.0 23.0 29.5
89533 18.5 14.0 13.0
89534 58.0 58.0 72.0

100 rows × 3 columns

Remove rows or columns via the drop method:¶

In [25]:
baseball.shape
Out[25]:
(100, 22)
In [26]:
baseball.drop([89525, 89526]) # does not modify the original DataFrame
Out[26]:
player year stint team lg g ab r h X2b ... rbi sb cs bb so ibb hbp sh sf gidp
id
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 ... 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 ... 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
88649 helliri01 2006 1 MIL NL 20 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 ... 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 ... 66.0 5.0 0.0 132 54.0 43.0 3.0 0.0 2.0 13.0
89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 ... 50.0 4.0 3.0 23 112.0 0.0 3.0 7.0 5.0 5.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 ... 25.0 6.0 1.0 37 74.0 3.0 6.0 4.0 1.0 11.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 ... 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

98 rows × 22 columns

In [27]:
baseball.shape
Out[27]:
(100, 22)
In [28]:
baseball.drop(['ibb','hbp'], axis=1) # Pandas axis=0 indicating row, axis=1 indicating column
Out[28]:
player year stint team lg g ab r h X2b X3b hr rbi sb cs bb so sh sf gidp
id
88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 1.0 1.0 4 4.0 3.0 0.0 0.0
88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0
88645 myersmi01 2006 1 NYA AL 62 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0
88649 helliri01 2006 1 MIL NL 20 3 0 0 0 0 0 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0
88650 johnsra05 2006 1 NYA AL 33 6 0 1 0 0 0 0.0 0.0 0.0 0 4.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0
89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3 3 25.0 6.0 1.0 37 74.0 4.0 1.0 11.0
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 3.0 0.0 27 30.0 0.0 3.0 13.0
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0

100 rows × 20 columns

Join two Pandas DataFrames¶

DataFrame and Series objects allow for several operations to take place either on a single object, or between two or more objects.

For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years:

In [29]:
hr2006 = baseball[baseball.year==2006].xs('hr', axis=1)
hr2006.index = baseball.player[baseball.year==2006]

hr2007 = baseball[baseball.year==2007].xs('hr', axis=1)
hr2007.index = baseball.player[baseball.year==2007]
In [30]:
hr2006 = pd.Series(baseball.hr[baseball.year==2006].values, index=baseball.player[baseball.year==2006])
hr2006
Out[30]:
player
womacto01     1
schilcu01     0
myersmi01     0
helliri01     0
johnsra05     0
finlest01     6
gonzalu01    15
seleaa01      0
dtype: int64
In [31]:
hr2007 = pd.Series(baseball.hr[baseball.year==2007].values, index=baseball.player[baseball.year==2007])
hr2007
Out[31]:
player
francju01     0
francju01     1
zaungr01     10
witasja01     0
williwo02     1
             ..
benitar01     0
benitar01     0
ausmubr01     3
aloumo01     13
alomasa02     0
Length: 92, dtype: int64
In [32]:
hr_total = hr2006 + hr2007 
hr_total
Out[32]:
player
alomasa02   NaN
aloumo01    NaN
ausmubr01   NaN
benitar01   NaN
benitar01   NaN
             ..
wickmbo01   NaN
williwo02   NaN
witasja01   NaN
womacto01   NaN
zaungr01    NaN
Length: 94, dtype: float64
  • Pandas' data alignment places NaN values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years.
In [33]:
hr_total[hr_total.notnull()]
Out[33]:
player
finlest01     7.0
gonzalu01    30.0
johnsra05     0.0
myersmi01     0.0
schilcu01     0.0
seleaa01      0.0
dtype: float64

Merging and joining DataFrame objects¶

In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the Automatic Identification System (AIS), a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed.

The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. AIS signals are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both spatial and temporal information.

For our purposes, we will use summarized data that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc.

In [34]:
segments = pd.read_csv("data/AIS/transit_segments.csv")
segments
Out[34]:
mmsi name transit segment seg_length avg_sog min_sog max_sog pdgt10 st_time end_time
0 1 Us Govt Ves 1 1 5.1 13.2 9.2 14.5 96.5 2/10/09 16:03 2/10/09 16:27
1 1 Dredge Capt Frank 1 1 13.5 18.6 10.4 20.6 100.0 4/6/09 14:31 4/6/09 15:20
2 1 Us Gov Vessel 1 1 4.3 16.2 10.3 20.5 100.0 4/6/09 14:36 4/6/09 14:55
3 1 Us Gov Vessel 2 1 9.2 15.4 14.5 16.1 100.0 4/10/09 17:58 4/10/09 18:34
4 1 Dredge Capt Frank 2 1 9.2 15.4 14.6 16.2 100.0 4/10/09 17:59 4/10/09 18:35
... ... ... ... ... ... ... ... ... ... ... ...
262521 999999999 Triple Attraction 3 1 5.3 20.0 19.6 20.4 100.0 6/15/10 12:49 6/15/10 13:05
262522 999999999 Triple Attraction 4 1 18.7 19.2 18.4 19.9 100.0 6/15/10 21:32 6/15/10 22:29
262523 999999999 Triple Attraction 6 1 17.4 17.0 14.7 18.4 100.0 6/17/10 19:16 6/17/10 20:17
262524 999999999 Triple Attraction 7 1 31.5 14.2 13.4 15.1 100.0 6/18/10 2:52 6/18/10 5:03
262525 999999999 Triple Attraction 8 1 19.8 18.6 16.1 19.5 100.0 6/18/10 10:19 6/18/10 11:22

262526 rows × 11 columns

  • In addition to the behavior of each vessel, we may want a little more information regarding the vessels themselves. In the data/AIS folder there is a second table that contains information about each of the ships that traveled the segments in the segments table.
In [35]:
vessels = pd.read_csv("data/AIS/vessel_information.csv", index_col='mmsi')
vessels
Out[35]:
num_names names sov flag flag_type num_loas loa max_loa num_types type
mmsi
1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing
9 3 000000009/Raven/Shearwater N Unknown Unknown 2 50.0/62.0 62.0 2 Pleasure/Tug
21 1 Us Gov Vessel Y Unknown Unknown 1 208.0 208.0 1 Unknown
74 2 Mcfaul/Sarah Bell N Unknown Unknown 1 155.0 155.0 1 Unknown
103 3 Ron G/Us Navy Warship 103/Us Warship 103 Y Unknown Unknown 2 26.0/155.0 155.0 2 Tanker/Unknown
... ... ... ... ... ... ... ... ... ... ...
919191919 1 Oi N Unknown Unknown 1 20.0 20.0 1 Pleasure
967191190 1 Pathfinder N Unknown Unknown 1 31.0 31.0 2 BigTow/Towing
975318642 1 Island Express N Unknown Unknown 1 20.0 20.0 1 Towing
987654321 2 Island Lookout/Island Tide N Unknown Unknown 2 22.0/23.0 23.0 2 Fishing/Towing
999999999 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure

10771 rows × 10 columns

In [36]:
vessels.type.value_counts()
Out[36]:
Cargo                      5622
Tanker                     2440
Pleasure                    601
Tug                         221
Sailing                     205
                           ... 
AntiPol/Other                 1
Fishing/Law                   1
Cargo/Other/Towing            1
Cargo/Fishing                 1
Fishing/Reserved/Towing       1
Name: type, Length: 206, dtype: int64
  • The challenge, however, is that several ships have travelled multiple segments, so there is not a one-to-one relationship between the rows of the two tables. The table of vessel information has a one-to-many relationship with the segments.

  • In Pandas, we can combine tables according to the value of one or more keys that are used to identify rows, much like an index. Using a trivial example:

In [40]:
import numpy as np
df1 = pd.DataFrame(dict(id=range(4), age=np.random.randint(18, 31, size=4)))
df2 = pd.DataFrame(dict(id=range(6), score=np.random.random(size=6)))
In [41]:
pd.merge(df1, df2)
Out[41]:
id age score
0 0 24 0.666117
1 1 23 0.249913
2 2 26 0.261396
3 3 30 0.651306
  • Notice that without any information about which column to use as a key, Pandas did the right thing and used the id column in both tables. Unless specified otherwise, merge will used any common column names as keys for merging the tables.

  • By default, merge performs an inner join on the tables, meaning that the merged table represents an intersection of the two tables.

In [42]:
pd.merge(df1, df2, how='outer')
Out[42]:
id age score
0 0 24.0 0.666117
1 1 23.0 0.249913
2 2 26.0 0.261396
3 3 30.0 0.651306
4 4 NaN 0.468143
5 5 NaN 0.461524
  • The outer join above yields the union of the two tables, so all rows are represented, with missing values inserted as appropriate. One can also perform right and left joins to include all rows of the right or left table (i.e. first or second argument to merge), but not necessarily the other.

  • Looking at the two datasets that we wish to merge:

In [43]:
segments.head(1)
Out[43]:
mmsi name transit segment seg_length avg_sog min_sog max_sog pdgt10 st_time end_time
0 1 Us Govt Ves 1 1 5.1 13.2 9.2 14.5 96.5 2/10/09 16:03 2/10/09 16:27
In [44]:
vessels.head(1)
Out[44]:
num_names names sov flag flag_type num_loas loa max_loa num_types type
mmsi
1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing
  • We see that there is a mmsi value (a vessel identifier) in each table, but it is used as an index for the vessels table. In this case, we have to specify to join on the index for this table, and on the mmsi column for the other.
In [45]:
segments_merged = pd.merge(vessels, segments, left_index=True, right_on='mmsi')
In [46]:
segments_merged
Out[46]:
num_names names sov flag flag_type num_loas loa max_loa num_types type ... name transit segment seg_length avg_sog min_sog max_sog pdgt10 st_time end_time
0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Us Govt Ves 1 1 5.1 13.2 9.2 14.5 96.5 2/10/09 16:03 2/10/09 16:27
1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Dredge Capt Frank 1 1 13.5 18.6 10.4 20.6 100.0 4/6/09 14:31 4/6/09 15:20
2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Us Gov Vessel 1 1 4.3 16.2 10.3 20.5 100.0 4/6/09 14:36 4/6/09 14:55
3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Us Gov Vessel 2 1 9.2 15.4 14.5 16.1 100.0 4/10/09 17:58 4/10/09 18:34
4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Dredge Capt Frank 2 1 9.2 15.4 14.6 16.2 100.0 4/10/09 17:59 4/10/09 18:35
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
262521 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 3 1 5.3 20.0 19.6 20.4 100.0 6/15/10 12:49 6/15/10 13:05
262522 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 4 1 18.7 19.2 18.4 19.9 100.0 6/15/10 21:32 6/15/10 22:29
262523 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 6 1 17.4 17.0 14.7 18.4 100.0 6/17/10 19:16 6/17/10 20:17
262524 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 7 1 31.5 14.2 13.4 15.1 100.0 6/18/10 2:52 6/18/10 5:03
262525 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 8 1 19.8 18.6 16.1 19.5 100.0 6/18/10 10:19 6/18/10 11:22

262353 rows × 21 columns

  • In this case, the default inner join is suitable; we are not interested in observations from either table that do not have corresponding entries in the other.

  • Notice that mmsi field that was an index on the vessels table is no longer an index on the merged table.

  • Here, we used the merge function to perform the merge; we could also have used the merge method for either of the tables:

In [47]:
vessels.merge(segments, left_index=True, right_on='mmsi')
Out[47]:
num_names names sov flag flag_type num_loas loa max_loa num_types type ... name transit segment seg_length avg_sog min_sog max_sog pdgt10 st_time end_time
0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Us Govt Ves 1 1 5.1 13.2 9.2 14.5 96.5 2/10/09 16:03 2/10/09 16:27
1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Dredge Capt Frank 1 1 13.5 18.6 10.4 20.6 100.0 4/6/09 14:31 4/6/09 15:20
2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Us Gov Vessel 1 1 4.3 16.2 10.3 20.5 100.0 4/6/09 14:36 4/6/09 14:55
3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Us Gov Vessel 2 1 9.2 15.4 14.5 16.1 100.0 4/10/09 17:58 4/10/09 18:34
4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... Dredge Capt Frank 2 1 9.2 15.4 14.6 16.2 100.0 4/10/09 17:59 4/10/09 18:35
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
262521 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 3 1 5.3 20.0 19.6 20.4 100.0 6/15/10 12:49 6/15/10 13:05
262522 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 4 1 18.7 19.2 18.4 19.9 100.0 6/15/10 21:32 6/15/10 22:29
262523 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 6 1 17.4 17.0 14.7 18.4 100.0 6/17/10 19:16 6/17/10 20:17
262524 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 7 1 31.5 14.2 13.4 15.1 100.0 6/18/10 2:52 6/18/10 5:03
262525 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... Triple Attraction 8 1 19.8 18.6 16.1 19.5 100.0 6/18/10 10:19 6/18/10 11:22

262353 rows × 21 columns

  • Occasionally, there will be fields with the same in both tables that we do not wish to use to join the tables; they may contain different information, despite having the same name. In this case, Pandas will by default append suffixes _x and _y to the columns to uniquely identify them.
In [48]:
segments['type'] = 'foo'
pd.merge(vessels, segments, left_index=True, right_on='mmsi')
Out[48]:
num_names names sov flag flag_type num_loas loa max_loa num_types type_x ... transit segment seg_length avg_sog min_sog max_sog pdgt10 st_time end_time type_y
0 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 1 5.1 13.2 9.2 14.5 96.5 2/10/09 16:03 2/10/09 16:27 foo
1 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 1 13.5 18.6 10.4 20.6 100.0 4/6/09 14:31 4/6/09 15:20 foo
2 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... 1 1 4.3 16.2 10.3 20.5 100.0 4/6/09 14:36 4/6/09 14:55 foo
3 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... 2 1 9.2 15.4 14.5 16.1 100.0 4/10/09 17:58 4/10/09 18:34 foo
4 8 Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... Y Unknown Unknown 7 42.0/48.0/57.0/90.0/138.0/154.0/156.0 156.0 4 Dredging/MilOps/Reserved/Towing ... 2 1 9.2 15.4 14.6 16.2 100.0 4/10/09 17:59 4/10/09 18:35 foo
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
262521 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... 3 1 5.3 20.0 19.6 20.4 100.0 6/15/10 12:49 6/15/10 13:05 foo
262522 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... 4 1 18.7 19.2 18.4 19.9 100.0 6/15/10 21:32 6/15/10 22:29 foo
262523 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... 6 1 17.4 17.0 14.7 18.4 100.0 6/17/10 19:16 6/17/10 20:17 foo
262524 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... 7 1 31.5 14.2 13.4 15.1 100.0 6/18/10 2:52 6/18/10 5:03 foo
262525 1 Triple Attraction N Unknown Unknown 1 30.0 30.0 1 Pleasure ... 8 1 19.8 18.6 16.1 19.5 100.0 6/18/10 10:19 6/18/10 11:22 foo

262353 rows × 22 columns

  • This behavior can be overridden by specifying a suffixes argument, containing a list of the suffixes to be used for the columns of the left and right columns, respectively.

Concatenation¶

A common data manipulation is appending rows or columns to a dataset that already conform to the dimensions of the exsiting rows or colums, respectively. In NumPy, this is done either with concatenate or the convenience functions c_ and r_:

In [49]:
np.concatenate([np.random.random(5), np.random.random(5)])
Out[49]:
array([0.42336207, 0.74807952, 0.61839076, 0.54794432, 0.06227732,
       0.71618874, 0.31763132, 0.26021656, 0.22395665, 0.08499033])
In [50]:
np.r_[np.random.random(5), np.random.random(5)]
Out[50]:
array([0.48601962, 0.11484701, 0.93892836, 0.16884999, 0.71700162,
       0.92519913, 0.26827622, 0.41866975, 0.59348726, 0.06054373])
In [51]:
np.c_[np.random.random(5), np.random.random(5)]
Out[51]:
array([[0.48861909, 0.23961022],
       [0.68685816, 0.7662155 ],
       [0.76304197, 0.63356894],
       [0.45533848, 0.36265383],
       [0.85205653, 0.84605096]])
  • This operation is also called binding or stacking. With Pandas' indexed data structures, there are additional considerations as the overlap in index values between two data structures affects how they are concatenate.

  • Lets import two microbiome datasets, each consisting of counts of microorganiams from a particular patient. We will use the first column of each dataset as the index.

In [52]:
# Pandas requires external modules to read Excel files
! pip3 install xlrd openpyxl --user
Looking in indexes: https://mirrors.163.com/pypi/simple/
Requirement already satisfied: xlrd in /home/fli/.local/lib/python3.9/site-packages (2.0.1)
Requirement already satisfied: openpyxl in /home/fli/.local/lib/python3.9/site-packages (3.0.9)
Requirement already satisfied: et-xmlfile in /home/fli/.local/lib/python3.9/site-packages (from openpyxl) (1.0.1)
In [53]:
mb1 = pd.read_excel('data/microbiome/MID1.xls', 'Sheet 1', index_col=0, header=None)
mb2 = pd.read_excel('data/microbiome/MID2.xls', 'Sheet 1', index_col=0, header=None)
mb1.shape, mb2.shape
Out[53]:
((272, 1), (288, 1))
In [54]:
mb1
Out[54]:
1
0
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera 7
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus 2
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus 3
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum 3
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella 7
... ...
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga 9
Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus 1
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae 2
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae 85
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta 1388

272 rows × 1 columns

  • Let's give the index and columns meaningful labels:
In [55]:
mb1.columns = mb2.columns = ['Count']
In [56]:
mb1.index.name = mb2.index.name = 'Taxon'
In [57]:
mb1
Out[57]:
Count
Taxon
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera 7
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus 2
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus 3
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum 3
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella 7
... ...
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga 9
Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus 1
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae 2
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae 85
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta 1388

272 rows × 1 columns

In [58]:
mb1.index[:3]
Out[58]:
Index(['Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera',
       'Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus',
       'Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus'],
      dtype='object', name='Taxon')
In [59]:
mb1.index.is_unique
Out[59]:
True
  • If we concatenate along axis=0 (the default), we will obtain another data frame with the the rows concatenated:
In [60]:
pd.concat([mb1, mb2], axis=0).shape
Out[60]:
(560, 1)

However, the index is no longer unique, due to overlap between the two DataFrames.

In [61]:
pd.concat([mb1, mb2], axis=0).index.is_unique
Out[61]:
False
  • Concatenating along axis=1 will concatenate column-wise, but respecting the indices of the two DataFrames.
In [62]:
pd.concat([mb1, mb2], axis=1).shape
Out[62]:
(438, 2)
In [63]:
pd.concat([mb1, mb2], axis=1)
Out[63]:
Count Count
Taxon
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera 7.0 23.0
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus 2.0 2.0
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus 3.0 10.0
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum 3.0 9.0
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella 7.0 9.0
... ... ...
Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira NaN 1.0
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio NaN 9.0
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga NaN 1.0
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator NaN 3.0
Bacteria TM7 TM7_genera_incertae_sedis NaN 2.0

438 rows × 2 columns

In [64]:
pd.concat([mb1, mb2], axis=1).values[:5]
Out[64]:
array([[ 7., 23.],
       [ 2.,  2.],
       [ 3., 10.],
       [ 3.,  9.],
       [ 7.,  9.]])
  • If we are only interested in taxa that are included in both DataFrames, we can specify a join=inner argument.
In [65]:
pd.concat([mb1, mb2], axis=1, join='inner')
Out[65]:
Count Count
Taxon
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera 7 23
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus 2 2
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus 3 10
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum 3 9
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella 7 9
... ... ...
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium 1 1
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Geotoga 7 15
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga 9 22
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae 85 1
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta 1388 2

122 rows × 2 columns

  • If we wanted to use the second table to fill values absent from the first table, we could use combine_first.
In [66]:
mb1.combine_first(mb2)
Out[66]:
Count
Taxon
Archaea "Crenarchaeota" Thermoprotei Acidilobales Acidilobaceae Acidilobus 2
Archaea "Crenarchaeota" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera 14
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera 7
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus 1
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera 2
... ...
Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus 1
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae 2
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae 85
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta 1388
Bacteria TM7 TM7_genera_incertae_sedis 2

438 rows × 1 columns

  • Alternatively, you can pass keys to the concatenation by supplying the DataFrames (or Series) as a dict.
In [67]:
pd.concat(dict(patient1=mb1, patient2=mb2), axis=1)
Out[67]:
patient1 patient2
Count Count
Taxon
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera 7.0 23.0
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus 2.0 2.0
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus 3.0 10.0
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum 3.0 9.0
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella 7.0 9.0
... ... ...
Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira NaN 1.0
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio NaN 9.0
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga NaN 1.0
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator NaN 3.0
Bacteria TM7 TM7_genera_incertae_sedis NaN 2.0

438 rows × 2 columns

  • If you want concat to work like numpy.concatanate, you may provide the ignore_index=True argument.

Missing data¶

The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in Series and DataFrame objects by the NaN floating point value. However, None is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).

In [68]:
import numpy as np
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo
Out[68]:
0       NaN
1        -3
2      None
3    foobar
dtype: object
In [69]:
foo.isnull()
Out[69]:
0     True
1    False
2     True
3    False
dtype: bool
  • Missing values may be dropped or indexed out:
In [74]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2
Out[74]:
Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64
In [75]:
bacteria2.dropna()
Out[75]:
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64
In [76]:
bacteria2[bacteria2.notnull()]
Out[76]:
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64
  • This can be overridden by passing the how='all' argument, which only drops a row when every field is a missing value.
In [ ]:
data.dropna(how='all')
  • This can be customized further by specifying how many values need to be present before a row is dropped via the thresh argument.
In [ ]:
data.dropna(thresh=4)
  • This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.

  • If we want to drop missing values column-wise instead of row-wise, we use axis=1.

In [ ]:
data.dropna(axis=1)
  • Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the fillna argument.
In [79]:
bacteria2.fillna(0)
Out[79]:
Cyanobacteria        0.0
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64
In [89]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'treatment':[1, 1, 1, None, 2, 2, 2, 2],
                     'year':[1994,1997,1999, None,2015,2017,2019,2021]})

data.fillna({'year': 2013, 'treatment':2})
Out[89]:
value treatment year
0 632 1.0 1994.0
1 1638 1.0 1997.0
2 569 1.0 1999.0
3 115 2.0 2013.0
4 433 2.0 2015.0
5 1130 2.0 2017.0
6 754 2.0 2019.0
7 555 2.0 2021.0
  • Notice that fillna by default returns a new object with the desired filling behavior, rather than changing the Series or DataFrame in place.

  • We can alter values in-place using inplace=True.

Data aggregation and GroupBy operations¶

One of the most powerful features of Pandas is its GroupBy functionality. On occasion we may want to perform operations on groups of observations within a dataset. For exmaple:

  • aggregation, such as computing the sum of mean of each group, which involves applying a function to each group and returning the aggregated results
  • slicing the DataFrame into groups and then doing something with the resulting slices (e.g. plotting)
  • group-wise transformation, such as standardization/normalization
In [91]:
cdystonia = pd.read_csv("data/cdystonia.csv")
cdystonia
Out[91]:
patient obs week site id treat age sex twstrs
0 1 1 0 1 1 5000U 65 F 32
1 1 2 2 1 1 5000U 65 F 30
2 1 3 4 1 1 5000U 65 F 24
3 1 4 8 1 1 5000U 65 F 37
4 1 5 12 1 1 5000U 65 F 39
... ... ... ... ... ... ... ... ... ...
626 109 1 0 9 11 5000U 57 M 53
627 109 2 2 9 11 5000U 57 M 38
628 109 4 8 9 11 5000U 57 M 33
629 109 5 12 9 11 5000U 57 M 36
630 109 6 16 9 11 5000U 57 M 51

631 rows × 9 columns

In [92]:
cdystonia_grouped = cdystonia.groupby(cdystonia.patient)
cdystonia_grouped
Out[92]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f473e94b5e0>
  • However, the grouping is only an intermediate step; for example, we may want to iterate over each of the patient groups:
In [93]:
# for patient, group in cdystonia_grouped:
#    print(patient)
#    print(group)
  • A common data analysis procedure is the split-apply-combine operation, which groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table. For example, we may want to aggregate our data with with some function.

  • We can aggregate in Pandas using the aggregate (or agg, for short) method:

In [94]:
import numpy as np
cdystonia_grouped.agg(np.mean)
Out[94]:
obs week site id age twstrs
patient
1 3.5 7.0 1.0 1.0 65.0 33.000000
2 3.5 7.0 1.0 2.0 70.0 47.666667
3 3.5 7.0 1.0 3.0 64.0 30.500000
4 2.5 3.5 1.0 4.0 59.0 60.000000
5 3.5 7.0 1.0 5.0 76.0 46.166667
... ... ... ... ... ... ...
105 3.5 7.0 9.0 7.0 79.0 43.666667
106 3.5 7.0 9.0 8.0 43.0 67.666667
107 3.5 7.0 9.0 9.0 50.0 42.000000
108 3.5 7.0 9.0 10.0 39.0 52.333333
109 3.6 7.6 9.0 11.0 57.0 42.200000

109 rows × 6 columns

  • Notice that the treat and sex variables are not included in the aggregation. Since it does not make sense to aggregate non-string variables, these columns are simply ignored by the method.

  • Some aggregation functions are so common that Pandas has a convenience method for them, such as mean:

In [95]:
cdystonia_grouped.mean()
Out[95]:
obs week site id age twstrs
patient
1 3.5 7.0 1.0 1.0 65.0 33.000000
2 3.5 7.0 1.0 2.0 70.0 47.666667
3 3.5 7.0 1.0 3.0 64.0 30.500000
4 2.5 3.5 1.0 4.0 59.0 60.000000
5 3.5 7.0 1.0 5.0 76.0 46.166667
... ... ... ... ... ... ...
105 3.5 7.0 9.0 7.0 79.0 43.666667
106 3.5 7.0 9.0 8.0 43.0 67.666667
107 3.5 7.0 9.0 9.0 50.0 42.000000
108 3.5 7.0 9.0 10.0 39.0 52.333333
109 3.6 7.6 9.0 11.0 57.0 42.200000

109 rows × 6 columns

  • The add_prefix and add_suffix methods can be used to give the columns of the resulting table labels that reflect the transformation:
In [96]:
cdystonia_grouped.mean().add_suffix('_mean')
Out[96]:
obs_mean week_mean site_mean id_mean age_mean twstrs_mean
patient
1 3.5 7.0 1.0 1.0 65.0 33.000000
2 3.5 7.0 1.0 2.0 70.0 47.666667
3 3.5 7.0 1.0 3.0 64.0 30.500000
4 2.5 3.5 1.0 4.0 59.0 60.000000
5 3.5 7.0 1.0 5.0 76.0 46.166667
... ... ... ... ... ... ...
105 3.5 7.0 9.0 7.0 79.0 43.666667
106 3.5 7.0 9.0 8.0 43.0 67.666667
107 3.5 7.0 9.0 9.0 50.0 42.000000
108 3.5 7.0 9.0 10.0 39.0 52.333333
109 3.6 7.6 9.0 11.0 57.0 42.200000

109 rows × 6 columns

In [97]:
# The median of the `twstrs` variable
cdystonia_grouped['twstrs'].quantile(0.5)
Out[97]:
patient
1      34.0
2      50.5
3      30.5
4      61.5
5      48.5
       ... 
105    45.5
106    67.5
107    44.0
108    50.5
109    38.0
Name: twstrs, Length: 109, dtype: float64
  • If we wish, we can easily aggregate according to multiple keys:
In [110]:
cdystonia.groupby(['week','site']).mean()
Out[110]:
patient obs id age twstrs
week site
0 1 6.5 1.0 6.5 59.000000 43.083333
2 19.5 1.0 7.5 53.928571 51.857143
3 32.5 1.0 6.5 51.500000 38.750000
4 42.5 1.0 4.5 59.250000 48.125000
5 49.5 1.0 3.5 51.833333 49.333333
6 60.0 1.0 8.0 51.866667 49.400000
7 73.5 1.0 6.5 59.250000 44.333333
8 89.0 1.0 10.0 57.263158 38.631579
9 104.0 1.0 6.0 55.454545 52.727273
2 1 6.5 2.0 6.5 59.000000 31.083333
2 19.0 2.0 7.0 52.923077 48.769231
3 32.5 2.0 6.5 51.500000 32.416667
4 42.5 2.0 4.5 59.250000 39.125000
5 49.0 2.0 3.0 50.000000 44.200000
6 60.0 2.0 8.0 51.866667 44.066667
7 73.5 2.0 6.5 59.250000 32.916667
8 88.5 2.0 9.5 58.562500 29.500000
9 103.7 2.0 5.7 56.000000 41.600000
4 1 6.5 3.0 6.5 59.000000 33.333333
2 19.5 3.0 7.5 53.928571 48.785714
  • Alternately, we can transform the data, using a function of our choice with the transform method:
In [107]:
cdystonia2 = cdystonia_grouped[["obs", "week", "twstrs"]]
normalize = lambda x: (x - x.mean())/x.std()

cdystonia2.transform(normalize)
Out[107]:
obs week twstrs
0 -1.336306 -1.135550 -0.181369
1 -0.801784 -0.811107 -0.544107
2 -0.267261 -0.486664 -1.632322
3 0.267261 0.162221 0.725476
4 0.801784 0.811107 1.088214
... ... ... ...
626 -1.253831 -1.135467 1.180487
627 -0.771589 -0.836660 -0.459078
628 0.192897 0.059761 -1.005600
629 0.675140 0.657376 -0.677687
630 1.157383 1.254990 0.961878

631 rows × 3 columns

  • It is easy to do column selection within groupby operations, if we are only interested split-apply-combine operations on a subset of columns:
In [100]:
cdystonia_grouped['twstrs'].mean()
Out[100]:
patient
1      33.000000
2      47.666667
3      30.500000
4      60.000000
5      46.166667
         ...    
105    43.666667
106    67.666667
107    42.000000
108    52.333333
109    42.200000
Name: twstrs, Length: 109, dtype: float64
  • If you simply want to divide your DataFrame into chunks for later use, its easy to convert them into a dict so that they can be easily indexed out as needed:
In [101]:
chunks = dict(list(cdystonia_grouped))
chunks[4]
Out[101]:
patient obs week site id treat age sex twstrs
18 4 1 0 1 4 Placebo 59 F 53
19 4 2 2 1 4 Placebo 59 F 61
20 4 3 4 1 4 Placebo 59 F 64
21 4 4 8 1 4 Placebo 59 F 62
  • By default, groupby groups by row, but we can specify the axis argument to change this. For example, we can group our columns by type this way:
In [102]:
dict(list(cdystonia.groupby(cdystonia.dtypes, axis=1)))
Out[102]:
{dtype('int64'):      patient  obs  week  site  id  age  twstrs
 0          1    1     0     1   1   65      32
 1          1    2     2     1   1   65      30
 2          1    3     4     1   1   65      24
 3          1    4     8     1   1   65      37
 4          1    5    12     1   1   65      39
 ..       ...  ...   ...   ...  ..  ...     ...
 626      109    1     0     9  11   57      53
 627      109    2     2     9  11   57      38
 628      109    4     8     9  11   57      33
 629      109    5    12     9  11   57      36
 630      109    6    16     9  11   57      51
 
 [631 rows x 7 columns],
 dtype('O'):      treat sex
 0    5000U   F
 1    5000U   F
 2    5000U   F
 3    5000U   F
 4    5000U   F
 ..     ...  ..
 626  5000U   M
 627  5000U   M
 628  5000U   M
 629  5000U   M
 630  5000U   M
 
 [631 rows x 2 columns]}