Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze.
Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a DataFrame
object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.
There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries.
These include
read_csv()
read_excel()
read_json()
read_parquet()
read_stata()
These are beyond the scope of this tutorial, but are covered in https://pandas.pydata.org/docs/user_guide/io.html .
Let's start with some more bacteria data, stored in csv format.
! head data/microbiome.csv
Taxon,Patient,Tissue,Stool Firmicutes,1,632,305 Firmicutes,2,136,4182 Firmicutes,3,1174,703 Firmicutes,4,408,3946 Firmicutes,5,831,8605 Firmicutes,6,693,50 Firmicutes,7,718,717 Firmicutes,8,173,33 Firmicutes,9,228,80
read_csv
:import pandas as pd
mb = pd.read_csv("microbiome.csv")
mb
Taxon | Patient | Tissue | Stool | |
---|---|---|---|---|
0 | Firmicutes | 1 | 632 | 305 |
1 | Firmicutes | 2 | 136 | 4182 |
2 | Firmicutes | 3 | 1174 | 703 |
3 | Firmicutes | 4 | 408 | 3946 |
4 | Firmicutes | 5 | 831 | 8605 |
... | ... | ... | ... | ... |
70 | Other | 11 | 203 | 6 |
71 | Other | 12 | 392 | 6 |
72 | Other | 13 | 28 | 25 |
73 | Other | 14 | 12 | 22 |
74 | Other | 15 | 305 | 32 |
75 rows × 4 columns
read_csv
automatically considered the first row in the file to be a header row. We can override default behavior by customizing some the arguments, like header
, names
or index_col
.pd.read_csv("data/microbiome.csv", header=None)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | Taxon | Patient | Tissue | Stool |
1 | Firmicutes | 1 | 632 | 305 |
2 | Firmicutes | 2 | 136 | 4182 |
3 | Firmicutes | 3 | 1174 | 703 |
4 | Firmicutes | 4 | 408 | 3946 |
... | ... | ... | ... | ... |
71 | Other | 11 | 203 | 6 |
72 | Other | 12 | 392 | 6 |
73 | Other | 13 | 28 | 25 |
74 | Other | 14 | 12 | 22 |
75 | Other | 15 | 305 | 32 |
76 rows × 4 columns
mb = pd.read_csv("data/microbiome.csv", index_col=['Taxon','Patient'])
mb
Tissue | Stool | ||
---|---|---|---|
Taxon | Patient | ||
Firmicutes | 1 | 632 | 305 |
2 | 136 | 4182 | |
3 | 1174 | 703 | |
4 | 408 | 3946 | |
5 | 831 | 8605 | |
... | ... | ... | ... |
Other | 11 | 203 | 6 |
12 | 392 | 6 | |
13 | 28 | 25 | |
14 | 12 | 22 | |
15 | 305 | 32 |
75 rows × 2 columns
skiprows
argument. This is useful for large dataset.pd.read_csv("data/microbiome.csv", skiprows=[3,4,6])
Taxon | Patient | Tissue | Stool | |
---|---|---|---|---|
0 | Firmicutes | 1 | 632 | 305 |
1 | Firmicutes | 2 | 136 | 4182 |
2 | Firmicutes | 5 | 831 | 8605 |
3 | Firmicutes | 7 | 718 | 717 |
4 | Firmicutes | 8 | 173 | 33 |
... | ... | ... | ... | ... |
67 | Other | 11 | 203 | 6 |
68 | Other | 12 | 392 | 6 |
69 | Other | 13 | 28 | 25 |
70 | Other | 14 | 12 | 22 |
71 | Other | 15 | 305 | 32 |
72 rows × 4 columns
nrows
to retrive the first nrows
.pd.read_csv("data/microbiome.csv", nrows=4)
Taxon | Patient | Tissue | Stool | |
---|---|---|---|---|
0 | Firmicutes | 1 | 632 | 305 |
1 | Firmicutes | 2 | 136 | 4182 |
2 | Firmicutes | 3 | 1174 | 703 |
3 | Firmicutes | 4 | 408 | 3946 |
chunksize
argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:data_chunks = pd.read_csv("data/microbiome.csv", chunksize=15)
data_chunks
<pandas.io.parsers.readers.TextFileReader at 0x7f4741161460>
NA
and NULL
.!head data/microbiome_missing.csv
Taxon,Patient,Tissue,Stool Firmicutes,1,632,305 Firmicutes,2,136,4182 Firmicutes,3,,703 Firmicutes,4,408,3946 Firmicutes,5,831,8605 Firmicutes,6,693,50 Firmicutes,7,718,717 Firmicutes,8,173,33 Firmicutes,9,228,NA
import pandas as pd
pd.read_csv("data/microbiome_missing.csv").head(20)
Taxon | Patient | Tissue | Stool | |
---|---|---|---|---|
0 | Firmicutes | 1 | 632 | 305.0 |
1 | Firmicutes | 2 | 136 | 4182.0 |
2 | Firmicutes | 3 | NaN | 703.0 |
3 | Firmicutes | 4 | 408 | 3946.0 |
4 | Firmicutes | 5 | 831 | 8605.0 |
5 | Firmicutes | 6 | 693 | 50.0 |
6 | Firmicutes | 7 | 718 | 717.0 |
7 | Firmicutes | 8 | 173 | 33.0 |
8 | Firmicutes | 9 | 228 | NaN |
9 | Firmicutes | 10 | 162 | 3196.0 |
10 | Firmicutes | 11 | 372 | -99999.0 |
11 | Firmicutes | 12 | 4255 | 4361.0 |
12 | Firmicutes | 13 | 107 | 1667.0 |
13 | Firmicutes | 14 | ? | 223.0 |
14 | Firmicutes | 15 | 281 | 2377.0 |
15 | Proteobacteria | 1 | 1638 | 3886.0 |
16 | Proteobacteria | 2 | 2469 | 1821.0 |
17 | Proteobacteria | 3 | 839 | 661.0 |
18 | Proteobacteria | 4 | 4414 | 18.0 |
19 | Proteobacteria | 5 | 12044 | 83.0 |
Above, Pandas recognized NA
and an empty field as missing data.
pd.isnull(pd.read_csv("data/microbiome_missing.csv")).head(20)
Taxon | Patient | Tissue | Stool | |
---|---|---|---|---|
0 | False | False | False | False |
1 | False | False | False | False |
2 | False | False | True | False |
3 | False | False | False | False |
4 | False | False | False | False |
5 | False | False | False | False |
6 | False | False | False | False |
7 | False | False | False | False |
8 | False | False | False | True |
9 | False | False | False | False |
10 | False | False | False | False |
11 | False | False | False | False |
12 | False | False | False | False |
13 | False | False | False | False |
14 | False | False | False | False |
15 | False | False | False | False |
16 | False | False | False | False |
17 | False | False | False | False |
18 | False | False | False | False |
19 | False | False | False | False |
Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the na_values
argument:
pd.read_csv("data/microbiome_missing.csv", na_values=['?', -99999]).head(20)
Taxon | Patient | Tissue | Stool | |
---|---|---|---|---|
0 | Firmicutes | 1 | 632.0 | 305.0 |
1 | Firmicutes | 2 | 136.0 | 4182.0 |
2 | Firmicutes | 3 | NaN | 703.0 |
3 | Firmicutes | 4 | 408.0 | 3946.0 |
4 | Firmicutes | 5 | 831.0 | 8605.0 |
5 | Firmicutes | 6 | 693.0 | 50.0 |
6 | Firmicutes | 7 | 718.0 | 717.0 |
7 | Firmicutes | 8 | 173.0 | 33.0 |
8 | Firmicutes | 9 | 228.0 | NaN |
9 | Firmicutes | 10 | 162.0 | 3196.0 |
10 | Firmicutes | 11 | 372.0 | NaN |
11 | Firmicutes | 12 | 4255.0 | 4361.0 |
12 | Firmicutes | 13 | 107.0 | 1667.0 |
13 | Firmicutes | 14 | NaN | 223.0 |
14 | Firmicutes | 15 | 281.0 | 2377.0 |
15 | Proteobacteria | 1 | 1638.0 | 3886.0 |
16 | Proteobacteria | 2 | 2469.0 | 1821.0 |
17 | Proteobacteria | 3 | 839.0 | 661.0 |
18 | Proteobacteria | 4 | 4414.0 | 18.0 |
19 | Proteobacteria | 5 | 12044.0 | 83.0 |
These can be specified on a column-wise basis using an appropriate dict as the argument for na_values
.
Reindexing allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.
For some variety, we will leave our digestive tract bacteria behind and employ some baseball data.
baseball = pd.read_csv("data/baseball.csv", index_col='id')
baseball
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89525 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
89526 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100 rows × 22 columns
id
column as the index, since it appears to be an unique identifier. We could try to create a unique index ourselves by combining player
and year
:player_id = baseball.player + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_id
baseball_newind
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
womacto012006 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
schilcu012006 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
myersmi012006 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
helliri012006 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
johnsra052006 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
benitar012007 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
benitar012007 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
ausmubr012007 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
aloumo012007 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
alomasa022007 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100 rows × 22 columns
baseball_newind.index.is_unique
False
reverse_index = baseball.index[::-1]
baseball.reindex(reverse_index)
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
89526 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
89525 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
100 rows × 22 columns
id
index is not sequential. Say we wanted to populate the table with every id
value. We could specify and index that is a sequence from the first to the last id
numbers in the database, and Pandas would fill in the missing data with NaN
values:id_range = range(baseball.index.values.min(), baseball.index.values.max())
baseball.reindex(id_range)
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
88641 | womacto01 | 2006.0 | 2.0 | CHN | NL | 19.0 | 50.0 | 6.0 | 14.0 | 1.0 | ... | 2.0 | 1.0 | 1.0 | 4.0 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
88642 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88643 | schilcu01 | 2006.0 | 1.0 | BOS | AL | 31.0 | 2.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88645 | myersmi01 | 2006.0 | 1.0 | NYA | AL | 62.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89529 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
89530 | ausmubr01 | 2007.0 | 1.0 | HOU | NL | 117.0 | 349.0 | 38.0 | 82.0 | 16.0 | ... | 25.0 | 6.0 | 1.0 | 37.0 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
89531 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
89532 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
89533 | aloumo01 | 2007.0 | 1.0 | NYN | NL | 87.0 | 328.0 | 51.0 | 112.0 | 19.0 | ... | 49.0 | 3.0 | 0.0 | 27.0 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
893 rows × 22 columns
baseball.reindex(id_range, fill_value='mr.nobody', columns=['player'])
player | |
---|---|
id | |
88641 | womacto01 |
88642 | mr.nobody |
88643 | schilcu01 |
88644 | mr.nobody |
88645 | myersmi01 |
... | ... |
89529 | mr.nobody |
89530 | ausmubr01 |
89531 | mr.nobody |
89532 | mr.nobody |
89533 | aloumo01 |
893 rows × 1 columns
baseball_newind.sort_index()
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
alomasa022007 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
aloumo012007 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
ausmubr012007 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
benitar012007 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
benitar012007 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
wickmbo012007 | wickmbo01 | 2007 | 1 | ATL | NL | 47 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
williwo022007 | williwo02 | 2007 | 1 | HOU | NL | 33 | 59 | 3 | 6 | 0 | ... | 2.0 | 0.0 | 0.0 | 0 | 25.0 | 0.0 | 0.0 | 5.0 | 0.0 | 1.0 |
witasja012007 | witasja01 | 2007 | 1 | TBA | AL | 3 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
womacto012006 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
zaungr012007 | zaungr01 | 2007 | 1 | TOR | AL | 110 | 331 | 43 | 80 | 24 | ... | 52.0 | 0.0 | 0.0 | 51 | 55.0 | 8.0 | 2.0 | 1.0 | 6.0 | 9.0 |
100 rows × 22 columns
baseball_newind.sort_index(ascending=False)
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zaungr012007 | zaungr01 | 2007 | 1 | TOR | AL | 110 | 331 | 43 | 80 | 24 | ... | 52.0 | 0.0 | 0.0 | 51 | 55.0 | 8.0 | 2.0 | 1.0 | 6.0 | 9.0 |
womacto012006 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
witasja012007 | witasja01 | 2007 | 1 | TBA | AL | 3 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
williwo022007 | williwo02 | 2007 | 1 | HOU | NL | 33 | 59 | 3 | 6 | 0 | ... | 2.0 | 0.0 | 0.0 | 0 | 25.0 | 0.0 | 0.0 | 5.0 | 0.0 | 1.0 |
wickmbo012007 | wickmbo01 | 2007 | 1 | ATL | NL | 47 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
benitar012007 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
benitar012007 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
ausmubr012007 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
aloumo012007 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
alomasa022007 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100 rows × 22 columns
baseball_newind.sort_index(axis=1)
X2b | X3b | ab | bb | cs | g | gidp | h | hbp | hr | ... | player | r | rbi | sb | sf | sh | so | stint | team | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
womacto012006 | 1 | 0 | 50 | 4 | 1.0 | 19 | 0.0 | 14 | 0.0 | 1 | ... | womacto01 | 6 | 2.0 | 1.0 | 0.0 | 3.0 | 4.0 | 2 | CHN | 2006 |
schilcu012006 | 0 | 0 | 2 | 0 | 0.0 | 31 | 0.0 | 1 | 0.0 | 0 | ... | schilcu01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | BOS | 2006 |
myersmi012006 | 0 | 0 | 0 | 0 | 0.0 | 62 | 0.0 | 0 | 0.0 | 0 | ... | myersmi01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | NYA | 2006 |
helliri012006 | 0 | 0 | 3 | 0 | 0.0 | 20 | 0.0 | 0 | 0.0 | 0 | ... | helliri01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1 | MIL | 2006 |
johnsra052006 | 0 | 0 | 6 | 0 | 0.0 | 33 | 0.0 | 1 | 0.0 | 0 | ... | johnsra05 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1 | NYA | 2006 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
benitar012007 | 0 | 0 | 0 | 0 | 0.0 | 34 | 0.0 | 0 | 0.0 | 0 | ... | benitar01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | FLO | 2007 |
benitar012007 | 0 | 0 | 0 | 0 | 0.0 | 19 | 0.0 | 0 | 0.0 | 0 | ... | benitar01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | SFN | 2007 |
ausmubr012007 | 16 | 3 | 349 | 37 | 1.0 | 117 | 11.0 | 82 | 6.0 | 3 | ... | ausmubr01 | 38 | 25.0 | 6.0 | 1.0 | 4.0 | 74.0 | 1 | HOU | 2007 |
aloumo012007 | 19 | 1 | 328 | 27 | 0.0 | 87 | 13.0 | 112 | 2.0 | 13 | ... | aloumo01 | 51 | 49.0 | 3.0 | 3.0 | 0.0 | 30.0 | 1 | NYN | 2007 |
alomasa022007 | 1 | 0 | 22 | 0 | 0.0 | 8 | 0.0 | 3 | 0.0 | 0 | ... | alomasa02 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1 | NYN | 2007 |
100 rows × 22 columns
baseball.hr.rank()
id 88641 62.5 88643 29.0 88645 29.0 88649 29.0 88650 29.0 ... 89525 29.0 89526 29.0 89530 71.5 89533 88.0 89534 29.0 Name: hr, Length: 100, dtype: float64
baseball.hr.rank(method='first')
id 88641 58.0 88643 1.0 88645 2.0 88649 3.0 88650 4.0 ... 89525 55.0 89526 56.0 89530 72.0 89533 88.0 89534 57.0 Name: hr, Length: 100, dtype: float64
DataFrame
's rank
method results in the ranks of all columns:baseball.rank(ascending=False)
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
88641 | 2.0 | 96.5 | 7.0 | 82.0 | 31.5 | 70.0 | 47.5 | 40.5 | 39.0 | 50.5 | ... | 51.0 | 24.5 | 17.5 | 44.5 | 59.0 | 66.0 | 65.5 | 16.0 | 70.0 | 76.5 |
88643 | 37.5 | 96.5 | 57.0 | 88.0 | 81.5 | 55.5 | 73.0 | 81.0 | 63.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 73.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
88645 | 47.5 | 96.5 | 57.0 | 40.5 | 81.5 | 36.0 | 91.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 89.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
88649 | 66.0 | 96.5 | 57.0 | 47.0 | 31.5 | 67.5 | 69.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 67.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
88650 | 61.5 | 96.5 | 57.0 | 40.5 | 81.5 | 51.0 | 64.5 | 81.0 | 63.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 59.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89525 | 96.5 | 46.5 | 7.0 | 64.0 | 31.5 | 47.0 | 91.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 89.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
89526 | 96.5 | 46.5 | 57.0 | 13.5 | 31.5 | 70.0 | 91.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 89.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
89530 | 98.0 | 46.5 | 57.0 | 61.5 | 31.5 | 17.5 | 19.0 | 24.0 | 23.0 | 21.5 | ... | 27.0 | 7.0 | 17.5 | 18.5 | 10.0 | 18.0 | 6.5 | 12.0 | 33.5 | 14.0 |
89533 | 99.0 | 46.5 | 57.0 | 31.5 | 31.5 | 23.0 | 22.0 | 18.5 | 14.0 | 17.5 | ... | 18.0 | 14.0 | 62.5 | 22.0 | 27.0 | 11.0 | 21.0 | 67.5 | 15.5 | 10.5 |
89534 | 100.0 | 46.5 | 57.0 | 31.5 | 31.5 | 77.0 | 57.0 | 58.0 | 58.0 | 50.5 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 63.5 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
100 rows × 22 columns
baseball[['r','h','hr']].rank(ascending=False)
r | h | hr | |
---|---|---|---|
id | |||
88641 | 40.5 | 39.0 | 38.5 |
88643 | 81.0 | 63.5 | 72.0 |
88645 | 81.0 | 84.5 | 72.0 |
88649 | 81.0 | 84.5 | 72.0 |
88650 | 81.0 | 63.5 | 72.0 |
... | ... | ... | ... |
89525 | 81.0 | 84.5 | 72.0 |
89526 | 81.0 | 84.5 | 72.0 |
89530 | 24.0 | 23.0 | 29.5 |
89533 | 18.5 | 14.0 | 13.0 |
89534 | 58.0 | 58.0 | 72.0 |
100 rows × 3 columns
drop
method:¶baseball.shape
(100, 22)
baseball.drop([89525, 89526]) # does not modify the original DataFrame
player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89521 | bondsba01 | 2007 | 1 | SFN | NL | 126 | 340 | 75 | 94 | 14 | ... | 66.0 | 5.0 | 0.0 | 132 | 54.0 | 43.0 | 3.0 | 0.0 | 2.0 | 13.0 |
89523 | biggicr01 | 2007 | 1 | HOU | NL | 141 | 517 | 68 | 130 | 31 | ... | 50.0 | 4.0 | 3.0 | 23 | 112.0 | 0.0 | 3.0 | 7.0 | 5.0 | 5.0 |
89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
98 rows × 22 columns
baseball.shape
(100, 22)
baseball.drop(['ibb','hbp'], axis=1) # Pandas axis=0 indicating row, axis=1 indicating column
player | year | stint | team | lg | g | ab | r | h | X2b | X3b | hr | rbi | sb | cs | bb | so | sh | sf | gidp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | 0 | 1 | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 3.0 | 0.0 | 0.0 |
88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 |
88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 |
88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89525 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
89526 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | 3 | 3 | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 4.0 | 1.0 | 11.0 |
89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | 1 | 13 | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 0.0 | 3.0 | 13.0 |
89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 |
100 rows × 20 columns
DataFrame
and Series
objects allow for several operations to take place either on a single object, or between two or more objects.
For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years:
hr2006 = baseball[baseball.year==2006].xs('hr', axis=1)
hr2006.index = baseball.player[baseball.year==2006]
hr2007 = baseball[baseball.year==2007].xs('hr', axis=1)
hr2007.index = baseball.player[baseball.year==2007]
hr2006 = pd.Series(baseball.hr[baseball.year==2006].values, index=baseball.player[baseball.year==2006])
hr2006
player womacto01 1 schilcu01 0 myersmi01 0 helliri01 0 johnsra05 0 finlest01 6 gonzalu01 15 seleaa01 0 dtype: int64
hr2007 = pd.Series(baseball.hr[baseball.year==2007].values, index=baseball.player[baseball.year==2007])
hr2007
player francju01 0 francju01 1 zaungr01 10 witasja01 0 williwo02 1 .. benitar01 0 benitar01 0 ausmubr01 3 aloumo01 13 alomasa02 0 Length: 92, dtype: int64
hr_total = hr2006 + hr2007
hr_total
player alomasa02 NaN aloumo01 NaN ausmubr01 NaN benitar01 NaN benitar01 NaN .. wickmbo01 NaN williwo02 NaN witasja01 NaN womacto01 NaN zaungr01 NaN Length: 94, dtype: float64
NaN
values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years.hr_total[hr_total.notnull()]
player finlest01 7.0 gonzalu01 30.0 johnsra05 0.0 myersmi01 0.0 schilcu01 0.0 seleaa01 0.0 dtype: float64
In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the Automatic Identification System (AIS), a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed.
The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. AIS signals are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both spatial and temporal information.
For our purposes, we will use summarized data that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc.
segments = pd.read_csv("data/AIS/transit_segments.csv")
segments
mmsi | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
1 | 1 | Dredge Capt Frank | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 |
2 | 1 | Us Gov Vessel | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 |
3 | 1 | Us Gov Vessel | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 |
4 | 1 | Dredge Capt Frank | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
262521 | 999999999 | Triple Attraction | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 |
262522 | 999999999 | Triple Attraction | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 |
262523 | 999999999 | Triple Attraction | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 |
262524 | 999999999 | Triple Attraction | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 |
262525 | 999999999 | Triple Attraction | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 |
262526 rows × 11 columns
data/AIS
folder there is a second table that contains information about each of the ships that traveled the segments in the segments
table.vessels = pd.read_csv("data/AIS/vessel_information.csv", index_col='mmsi')
vessels
num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | |
---|---|---|---|---|---|---|---|---|---|---|
mmsi | ||||||||||
1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing |
9 | 3 | 000000009/Raven/Shearwater | N | Unknown | Unknown | 2 | 50.0/62.0 | 62.0 | 2 | Pleasure/Tug |
21 | 1 | Us Gov Vessel | Y | Unknown | Unknown | 1 | 208.0 | 208.0 | 1 | Unknown |
74 | 2 | Mcfaul/Sarah Bell | N | Unknown | Unknown | 1 | 155.0 | 155.0 | 1 | Unknown |
103 | 3 | Ron G/Us Navy Warship 103/Us Warship 103 | Y | Unknown | Unknown | 2 | 26.0/155.0 | 155.0 | 2 | Tanker/Unknown |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
919191919 | 1 | Oi | N | Unknown | Unknown | 1 | 20.0 | 20.0 | 1 | Pleasure |
967191190 | 1 | Pathfinder | N | Unknown | Unknown | 1 | 31.0 | 31.0 | 2 | BigTow/Towing |
975318642 | 1 | Island Express | N | Unknown | Unknown | 1 | 20.0 | 20.0 | 1 | Towing |
987654321 | 2 | Island Lookout/Island Tide | N | Unknown | Unknown | 2 | 22.0/23.0 | 23.0 | 2 | Fishing/Towing |
999999999 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure |
10771 rows × 10 columns
vessels.type.value_counts()
Cargo 5622 Tanker 2440 Pleasure 601 Tug 221 Sailing 205 ... AntiPol/Other 1 Fishing/Law 1 Cargo/Other/Towing 1 Cargo/Fishing 1 Fishing/Reserved/Towing 1 Name: type, Length: 206, dtype: int64
The challenge, however, is that several ships have travelled multiple segments, so there is not a one-to-one relationship between the rows of the two tables. The table of vessel information has a one-to-many relationship with the segments.
In Pandas, we can combine tables according to the value of one or more keys that are used to identify rows, much like an index. Using a trivial example:
import numpy as np
df1 = pd.DataFrame(dict(id=range(4), age=np.random.randint(18, 31, size=4)))
df2 = pd.DataFrame(dict(id=range(6), score=np.random.random(size=6)))
pd.merge(df1, df2)
id | age | score | |
---|---|---|---|
0 | 0 | 24 | 0.666117 |
1 | 1 | 23 | 0.249913 |
2 | 2 | 26 | 0.261396 |
3 | 3 | 30 | 0.651306 |
Notice that without any information about which column to use as a key, Pandas did the right thing and used the id
column in both tables. Unless specified otherwise, merge
will used any common column names as keys for merging the tables.
By default, merge
performs an inner join on the tables, meaning that the merged table represents an intersection of the two tables.
pd.merge(df1, df2, how='outer')
id | age | score | |
---|---|---|---|
0 | 0 | 24.0 | 0.666117 |
1 | 1 | 23.0 | 0.249913 |
2 | 2 | 26.0 | 0.261396 |
3 | 3 | 30.0 | 0.651306 |
4 | 4 | NaN | 0.468143 |
5 | 5 | NaN | 0.461524 |
The outer join above yields the union of the two tables, so all rows are represented, with missing values inserted as appropriate. One can also perform right and left joins to include all rows of the right or left table (i.e. first or second argument to merge
), but not necessarily the other.
Looking at the two datasets that we wish to merge:
segments.head(1)
mmsi | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
vessels.head(1)
num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | |
---|---|---|---|---|---|---|---|---|---|---|
mmsi | ||||||||||
1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing |
mmsi
value (a vessel identifier) in each table, but it is used as an index for the vessels
table. In this case, we have to specify to join on the index for this table, and on the mmsi
column for the other.segments_merged = pd.merge(vessels, segments, left_index=True, right_on='mmsi')
segments_merged
num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | ... | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 |
2 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 |
3 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 |
4 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
262521 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 |
262522 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 |
262523 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 |
262524 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 |
262525 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 |
262353 rows × 21 columns
In this case, the default inner join is suitable; we are not interested in observations from either table that do not have corresponding entries in the other.
Notice that mmsi
field that was an index on the vessels
table is no longer an index on the merged table.
Here, we used the merge
function to perform the merge; we could also have used the merge
method for either of the tables:
vessels.merge(segments, left_index=True, right_on='mmsi')
num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | ... | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 |
2 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 |
3 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 |
4 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
262521 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 |
262522 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 |
262523 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 |
262524 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 |
262525 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 |
262353 rows × 21 columns
_x
and _y
to the columns to uniquely identify them.segments['type'] = 'foo'
pd.merge(vessels, segments, left_index=True, right_on='mmsi')
num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type_x | ... | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | type_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 | foo |
1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 | foo |
2 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 | foo |
3 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 | foo |
4 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 | foo |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
262521 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 | foo |
262522 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 | foo |
262523 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 | foo |
262524 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 | foo |
262525 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 | foo |
262353 rows × 22 columns
suffixes
argument, containing a list of the suffixes to be used for the columns of the left and right columns, respectively.A common data manipulation is appending rows or columns to a dataset that already conform to the dimensions of the exsiting rows or colums, respectively. In NumPy, this is done either with concatenate
or the convenience functions c_
and r_
:
np.concatenate([np.random.random(5), np.random.random(5)])
array([0.42336207, 0.74807952, 0.61839076, 0.54794432, 0.06227732, 0.71618874, 0.31763132, 0.26021656, 0.22395665, 0.08499033])
np.r_[np.random.random(5), np.random.random(5)]
array([0.48601962, 0.11484701, 0.93892836, 0.16884999, 0.71700162, 0.92519913, 0.26827622, 0.41866975, 0.59348726, 0.06054373])
np.c_[np.random.random(5), np.random.random(5)]
array([[0.48861909, 0.23961022], [0.68685816, 0.7662155 ], [0.76304197, 0.63356894], [0.45533848, 0.36265383], [0.85205653, 0.84605096]])
This operation is also called binding or stacking. With Pandas' indexed data structures, there are additional considerations as the overlap in index values between two data structures affects how they are concatenate.
Lets import two microbiome datasets, each consisting of counts of microorganiams from a particular patient. We will use the first column of each dataset as the index.
# Pandas requires external modules to read Excel files
! pip3 install xlrd openpyxl --user
Looking in indexes: https://mirrors.163.com/pypi/simple/ Requirement already satisfied: xlrd in /home/fli/.local/lib/python3.9/site-packages (2.0.1) Requirement already satisfied: openpyxl in /home/fli/.local/lib/python3.9/site-packages (3.0.9) Requirement already satisfied: et-xmlfile in /home/fli/.local/lib/python3.9/site-packages (from openpyxl) (1.0.1)
mb1 = pd.read_excel('data/microbiome/MID1.xls', 'Sheet 1', index_col=0, header=None)
mb2 = pd.read_excel('data/microbiome/MID2.xls', 'Sheet 1', index_col=0, header=None)
mb1.shape, mb2.shape
((272, 1), (288, 1))
mb1
1 | |
---|---|
0 | |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2 |
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3 |
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3 |
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7 |
... | ... |
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga | 9 |
Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus | 1 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | 2 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 |
272 rows × 1 columns
mb1.columns = mb2.columns = ['Count']
mb1.index.name = mb2.index.name = 'Taxon'
mb1
Count | |
---|---|
Taxon | |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2 |
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3 |
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3 |
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7 |
... | ... |
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga | 9 |
Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus | 1 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | 2 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 |
272 rows × 1 columns
mb1.index[:3]
Index(['Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera', 'Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus', 'Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus'], dtype='object', name='Taxon')
mb1.index.is_unique
True
axis=0
(the default), we will obtain another data frame with the the rows concatenated:pd.concat([mb1, mb2], axis=0).shape
(560, 1)
However, the index is no longer unique, due to overlap between the two DataFrames.
pd.concat([mb1, mb2], axis=0).index.is_unique
False
axis=1
will concatenate column-wise, but respecting the indices of the two DataFrames.pd.concat([mb1, mb2], axis=1).shape
(438, 2)
pd.concat([mb1, mb2], axis=1)
Count | Count | |
---|---|---|
Taxon | ||
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7.0 | 23.0 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2.0 | 2.0 |
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3.0 | 10.0 |
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3.0 | 9.0 |
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7.0 | 9.0 |
... | ... | ... |
Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira | NaN | 1.0 |
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio | NaN | 9.0 |
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga | NaN | 1.0 |
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator | NaN | 3.0 |
Bacteria TM7 TM7_genera_incertae_sedis | NaN | 2.0 |
438 rows × 2 columns
pd.concat([mb1, mb2], axis=1).values[:5]
array([[ 7., 23.], [ 2., 2.], [ 3., 10.], [ 3., 9.], [ 7., 9.]])
join=inner
argument.pd.concat([mb1, mb2], axis=1, join='inner')
Count | Count | |
---|---|---|
Taxon | ||
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 | 23 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2 | 2 |
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3 | 10 |
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3 | 9 |
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7 | 9 |
... | ... | ... |
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium | 1 | 1 |
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Geotoga | 7 | 15 |
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga | 9 | 22 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 | 1 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 | 2 |
122 rows × 2 columns
combine_first
.mb1.combine_first(mb2)
Count | |
---|---|
Taxon | |
Archaea "Crenarchaeota" Thermoprotei Acidilobales Acidilobaceae Acidilobus | 2 |
Archaea "Crenarchaeota" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera | 14 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus | 1 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera | 2 |
... | ... |
Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus | 1 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | 2 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 |
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 |
Bacteria TM7 TM7_genera_incertae_sedis | 2 |
438 rows × 1 columns
pd.concat(dict(patient1=mb1, patient2=mb2), axis=1)
patient1 | patient2 | |
---|---|---|
Count | Count | |
Taxon | ||
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7.0 | 23.0 |
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2.0 | 2.0 |
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3.0 | 10.0 |
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3.0 | 9.0 |
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7.0 | 9.0 |
... | ... | ... |
Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira | NaN | 1.0 |
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio | NaN | 9.0 |
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga | NaN | 1.0 |
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator | NaN | 3.0 |
Bacteria TM7 TM7_genera_incertae_sedis | NaN | 2.0 |
438 rows × 2 columns
concat
to work like numpy.concatanate
, you may provide the ignore_index=True
argument.The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.
Missing data are represented in Series
and DataFrame
objects by the NaN
floating point value. However, None
is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).
import numpy as np
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo
0 NaN 1 -3 2 None 3 foobar dtype: object
foo.isnull()
0 True 1 False 2 True 3 False dtype: bool
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2
Cyanobacteria NaN Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
bacteria2.dropna()
Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
bacteria2[bacteria2.notnull()]
Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
how='all'
argument, which only drops a row when every field is a missing value.data.dropna(how='all')
thresh
argument.data.dropna(thresh=4)
This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.
If we want to drop missing values column-wise instead of row-wise, we use axis=1
.
data.dropna(axis=1)
fillna
argument.bacteria2.fillna(0)
Cyanobacteria 0.0 Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
'treatment':[1, 1, 1, None, 2, 2, 2, 2],
'year':[1994,1997,1999, None,2015,2017,2019,2021]})
data.fillna({'year': 2013, 'treatment':2})
value | treatment | year | |
---|---|---|---|
0 | 632 | 1.0 | 1994.0 |
1 | 1638 | 1.0 | 1997.0 |
2 | 569 | 1.0 | 1999.0 |
3 | 115 | 2.0 | 2013.0 |
4 | 433 | 2.0 | 2015.0 |
5 | 1130 | 2.0 | 2017.0 |
6 | 754 | 2.0 | 2019.0 |
7 | 555 | 2.0 | 2021.0 |
Notice that fillna
by default returns a new object with the desired filling behavior, rather than changing the Series
or DataFrame
in place.
We can alter values in-place using inplace=True
.
One of the most powerful features of Pandas is its GroupBy functionality. On occasion we may want to perform operations on groups of observations within a dataset. For exmaple:
cdystonia = pd.read_csv("data/cdystonia.csv")
cdystonia
patient | obs | week | site | id | treat | age | sex | twstrs | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 1 | 1 | 5000U | 65 | F | 32 |
1 | 1 | 2 | 2 | 1 | 1 | 5000U | 65 | F | 30 |
2 | 1 | 3 | 4 | 1 | 1 | 5000U | 65 | F | 24 |
3 | 1 | 4 | 8 | 1 | 1 | 5000U | 65 | F | 37 |
4 | 1 | 5 | 12 | 1 | 1 | 5000U | 65 | F | 39 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
626 | 109 | 1 | 0 | 9 | 11 | 5000U | 57 | M | 53 |
627 | 109 | 2 | 2 | 9 | 11 | 5000U | 57 | M | 38 |
628 | 109 | 4 | 8 | 9 | 11 | 5000U | 57 | M | 33 |
629 | 109 | 5 | 12 | 9 | 11 | 5000U | 57 | M | 36 |
630 | 109 | 6 | 16 | 9 | 11 | 5000U | 57 | M | 51 |
631 rows × 9 columns
cdystonia_grouped = cdystonia.groupby(cdystonia.patient)
cdystonia_grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f473e94b5e0>
# for patient, group in cdystonia_grouped:
# print(patient)
# print(group)
A common data analysis procedure is the split-apply-combine operation, which groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table. For example, we may want to aggregate our data with with some function.
We can aggregate in Pandas using the aggregate
(or agg
, for short) method:
import numpy as np
cdystonia_grouped.agg(np.mean)
obs | week | site | id | age | twstrs | |
---|---|---|---|---|---|---|
patient | ||||||
1 | 3.5 | 7.0 | 1.0 | 1.0 | 65.0 | 33.000000 |
2 | 3.5 | 7.0 | 1.0 | 2.0 | 70.0 | 47.666667 |
3 | 3.5 | 7.0 | 1.0 | 3.0 | 64.0 | 30.500000 |
4 | 2.5 | 3.5 | 1.0 | 4.0 | 59.0 | 60.000000 |
5 | 3.5 | 7.0 | 1.0 | 5.0 | 76.0 | 46.166667 |
... | ... | ... | ... | ... | ... | ... |
105 | 3.5 | 7.0 | 9.0 | 7.0 | 79.0 | 43.666667 |
106 | 3.5 | 7.0 | 9.0 | 8.0 | 43.0 | 67.666667 |
107 | 3.5 | 7.0 | 9.0 | 9.0 | 50.0 | 42.000000 |
108 | 3.5 | 7.0 | 9.0 | 10.0 | 39.0 | 52.333333 |
109 | 3.6 | 7.6 | 9.0 | 11.0 | 57.0 | 42.200000 |
109 rows × 6 columns
Notice that the treat
and sex
variables are not included in the aggregation. Since it does not make sense to aggregate non-string variables, these columns are simply ignored by the method.
Some aggregation functions are so common that Pandas has a convenience method for them, such as mean
:
cdystonia_grouped.mean()
obs | week | site | id | age | twstrs | |
---|---|---|---|---|---|---|
patient | ||||||
1 | 3.5 | 7.0 | 1.0 | 1.0 | 65.0 | 33.000000 |
2 | 3.5 | 7.0 | 1.0 | 2.0 | 70.0 | 47.666667 |
3 | 3.5 | 7.0 | 1.0 | 3.0 | 64.0 | 30.500000 |
4 | 2.5 | 3.5 | 1.0 | 4.0 | 59.0 | 60.000000 |
5 | 3.5 | 7.0 | 1.0 | 5.0 | 76.0 | 46.166667 |
... | ... | ... | ... | ... | ... | ... |
105 | 3.5 | 7.0 | 9.0 | 7.0 | 79.0 | 43.666667 |
106 | 3.5 | 7.0 | 9.0 | 8.0 | 43.0 | 67.666667 |
107 | 3.5 | 7.0 | 9.0 | 9.0 | 50.0 | 42.000000 |
108 | 3.5 | 7.0 | 9.0 | 10.0 | 39.0 | 52.333333 |
109 | 3.6 | 7.6 | 9.0 | 11.0 | 57.0 | 42.200000 |
109 rows × 6 columns
add_prefix
and add_suffix
methods can be used to give the columns of the resulting table labels that reflect the transformation:cdystonia_grouped.mean().add_suffix('_mean')
obs_mean | week_mean | site_mean | id_mean | age_mean | twstrs_mean | |
---|---|---|---|---|---|---|
patient | ||||||
1 | 3.5 | 7.0 | 1.0 | 1.0 | 65.0 | 33.000000 |
2 | 3.5 | 7.0 | 1.0 | 2.0 | 70.0 | 47.666667 |
3 | 3.5 | 7.0 | 1.0 | 3.0 | 64.0 | 30.500000 |
4 | 2.5 | 3.5 | 1.0 | 4.0 | 59.0 | 60.000000 |
5 | 3.5 | 7.0 | 1.0 | 5.0 | 76.0 | 46.166667 |
... | ... | ... | ... | ... | ... | ... |
105 | 3.5 | 7.0 | 9.0 | 7.0 | 79.0 | 43.666667 |
106 | 3.5 | 7.0 | 9.0 | 8.0 | 43.0 | 67.666667 |
107 | 3.5 | 7.0 | 9.0 | 9.0 | 50.0 | 42.000000 |
108 | 3.5 | 7.0 | 9.0 | 10.0 | 39.0 | 52.333333 |
109 | 3.6 | 7.6 | 9.0 | 11.0 | 57.0 | 42.200000 |
109 rows × 6 columns
# The median of the `twstrs` variable
cdystonia_grouped['twstrs'].quantile(0.5)
patient 1 34.0 2 50.5 3 30.5 4 61.5 5 48.5 ... 105 45.5 106 67.5 107 44.0 108 50.5 109 38.0 Name: twstrs, Length: 109, dtype: float64
cdystonia.groupby(['week','site']).mean()
patient | obs | id | age | twstrs | ||
---|---|---|---|---|---|---|
week | site | |||||
0 | 1 | 6.5 | 1.0 | 6.5 | 59.000000 | 43.083333 |
2 | 19.5 | 1.0 | 7.5 | 53.928571 | 51.857143 | |
3 | 32.5 | 1.0 | 6.5 | 51.500000 | 38.750000 | |
4 | 42.5 | 1.0 | 4.5 | 59.250000 | 48.125000 | |
5 | 49.5 | 1.0 | 3.5 | 51.833333 | 49.333333 | |
6 | 60.0 | 1.0 | 8.0 | 51.866667 | 49.400000 | |
7 | 73.5 | 1.0 | 6.5 | 59.250000 | 44.333333 | |
8 | 89.0 | 1.0 | 10.0 | 57.263158 | 38.631579 | |
9 | 104.0 | 1.0 | 6.0 | 55.454545 | 52.727273 | |
2 | 1 | 6.5 | 2.0 | 6.5 | 59.000000 | 31.083333 |
2 | 19.0 | 2.0 | 7.0 | 52.923077 | 48.769231 | |
3 | 32.5 | 2.0 | 6.5 | 51.500000 | 32.416667 | |
4 | 42.5 | 2.0 | 4.5 | 59.250000 | 39.125000 | |
5 | 49.0 | 2.0 | 3.0 | 50.000000 | 44.200000 | |
6 | 60.0 | 2.0 | 8.0 | 51.866667 | 44.066667 | |
7 | 73.5 | 2.0 | 6.5 | 59.250000 | 32.916667 | |
8 | 88.5 | 2.0 | 9.5 | 58.562500 | 29.500000 | |
9 | 103.7 | 2.0 | 5.7 | 56.000000 | 41.600000 | |
4 | 1 | 6.5 | 3.0 | 6.5 | 59.000000 | 33.333333 |
2 | 19.5 | 3.0 | 7.5 | 53.928571 | 48.785714 |
transform
method:cdystonia2 = cdystonia_grouped[["obs", "week", "twstrs"]]
normalize = lambda x: (x - x.mean())/x.std()
cdystonia2.transform(normalize)
obs | week | twstrs | |
---|---|---|---|
0 | -1.336306 | -1.135550 | -0.181369 |
1 | -0.801784 | -0.811107 | -0.544107 |
2 | -0.267261 | -0.486664 | -1.632322 |
3 | 0.267261 | 0.162221 | 0.725476 |
4 | 0.801784 | 0.811107 | 1.088214 |
... | ... | ... | ... |
626 | -1.253831 | -1.135467 | 1.180487 |
627 | -0.771589 | -0.836660 | -0.459078 |
628 | 0.192897 | 0.059761 | -1.005600 |
629 | 0.675140 | 0.657376 | -0.677687 |
630 | 1.157383 | 1.254990 | 0.961878 |
631 rows × 3 columns
groupby
operations, if we are only interested split-apply-combine operations on a subset of columns:cdystonia_grouped['twstrs'].mean()
patient 1 33.000000 2 47.666667 3 30.500000 4 60.000000 5 46.166667 ... 105 43.666667 106 67.666667 107 42.000000 108 52.333333 109 42.200000 Name: twstrs, Length: 109, dtype: float64
chunks = dict(list(cdystonia_grouped))
chunks[4]
patient | obs | week | site | id | treat | age | sex | twstrs | |
---|---|---|---|---|---|---|---|---|---|
18 | 4 | 1 | 0 | 1 | 4 | Placebo | 59 | F | 53 |
19 | 4 | 2 | 2 | 1 | 4 | Placebo | 59 | F | 61 |
20 | 4 | 3 | 4 | 1 | 4 | Placebo | 59 | F | 64 |
21 | 4 | 4 | 8 | 1 | 4 | Placebo | 59 | F | 62 |
groupby
groups by row, but we can specify the axis
argument to change this. For example, we can group our columns by type this way:dict(list(cdystonia.groupby(cdystonia.dtypes, axis=1)))
{dtype('int64'): patient obs week site id age twstrs 0 1 1 0 1 1 65 32 1 1 2 2 1 1 65 30 2 1 3 4 1 1 65 24 3 1 4 8 1 1 65 37 4 1 5 12 1 1 65 39 .. ... ... ... ... .. ... ... 626 109 1 0 9 11 57 53 627 109 2 2 9 11 57 38 628 109 4 8 9 11 57 33 629 109 5 12 9 11 57 36 630 109 6 16 9 11 57 51 [631 rows x 7 columns], dtype('O'): treat sex 0 5000U F 1 5000U F 2 5000U F 3 5000U F 4 5000U F .. ... .. 626 5000U M 627 5000U M 628 5000U M 629 5000U M 630 5000U M [631 rows x 2 columns]}