Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze.
Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a DataFrame object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.
There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries.
These include
read_csv()read_excel()read_json()read_parquet()read_stata()These are beyond the scope of this tutorial, but are covered in https://pandas.pydata.org/docs/user_guide/io.html .
Let's start with some more bacteria data, stored in csv format.
! head data/microbiome.csv
Taxon,Patient,Tissue,Stool Firmicutes,1,632,305 Firmicutes,2,136,4182 Firmicutes,3,1174,703 Firmicutes,4,408,3946 Firmicutes,5,831,8605 Firmicutes,6,693,50 Firmicutes,7,718,717 Firmicutes,8,173,33 Firmicutes,9,228,80
read_csv:import pandas as pd
mb = pd.read_csv("microbiome.csv")
mb
| Taxon | Patient | Tissue | Stool | |
|---|---|---|---|---|
| 0 | Firmicutes | 1 | 632 | 305 |
| 1 | Firmicutes | 2 | 136 | 4182 |
| 2 | Firmicutes | 3 | 1174 | 703 |
| 3 | Firmicutes | 4 | 408 | 3946 |
| 4 | Firmicutes | 5 | 831 | 8605 |
| ... | ... | ... | ... | ... |
| 70 | Other | 11 | 203 | 6 |
| 71 | Other | 12 | 392 | 6 |
| 72 | Other | 13 | 28 | 25 |
| 73 | Other | 14 | 12 | 22 |
| 74 | Other | 15 | 305 | 32 |
75 rows × 4 columns
read_csv automatically considered the first row in the file to be a header row. We can override default behavior by customizing some the arguments, like header, names or index_col.pd.read_csv("data/microbiome.csv", header=None)
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | Taxon | Patient | Tissue | Stool |
| 1 | Firmicutes | 1 | 632 | 305 |
| 2 | Firmicutes | 2 | 136 | 4182 |
| 3 | Firmicutes | 3 | 1174 | 703 |
| 4 | Firmicutes | 4 | 408 | 3946 |
| ... | ... | ... | ... | ... |
| 71 | Other | 11 | 203 | 6 |
| 72 | Other | 12 | 392 | 6 |
| 73 | Other | 13 | 28 | 25 |
| 74 | Other | 14 | 12 | 22 |
| 75 | Other | 15 | 305 | 32 |
76 rows × 4 columns
mb = pd.read_csv("data/microbiome.csv", index_col=['Taxon','Patient'])
mb
| Tissue | Stool | ||
|---|---|---|---|
| Taxon | Patient | ||
| Firmicutes | 1 | 632 | 305 |
| 2 | 136 | 4182 | |
| 3 | 1174 | 703 | |
| 4 | 408 | 3946 | |
| 5 | 831 | 8605 | |
| ... | ... | ... | ... |
| Other | 11 | 203 | 6 |
| 12 | 392 | 6 | |
| 13 | 28 | 25 | |
| 14 | 12 | 22 | |
| 15 | 305 | 32 |
75 rows × 2 columns
skiprows argument. This is useful for large dataset.pd.read_csv("data/microbiome.csv", skiprows=[3,4,6])
| Taxon | Patient | Tissue | Stool | |
|---|---|---|---|---|
| 0 | Firmicutes | 1 | 632 | 305 |
| 1 | Firmicutes | 2 | 136 | 4182 |
| 2 | Firmicutes | 5 | 831 | 8605 |
| 3 | Firmicutes | 7 | 718 | 717 |
| 4 | Firmicutes | 8 | 173 | 33 |
| ... | ... | ... | ... | ... |
| 67 | Other | 11 | 203 | 6 |
| 68 | Other | 12 | 392 | 6 |
| 69 | Other | 13 | 28 | 25 |
| 70 | Other | 14 | 12 | 22 |
| 71 | Other | 15 | 305 | 32 |
72 rows × 4 columns
nrows to retrive the first nrows.pd.read_csv("data/microbiome.csv", nrows=4)
| Taxon | Patient | Tissue | Stool | |
|---|---|---|---|---|
| 0 | Firmicutes | 1 | 632 | 305 |
| 1 | Firmicutes | 2 | 136 | 4182 |
| 2 | Firmicutes | 3 | 1174 | 703 |
| 3 | Firmicutes | 4 | 408 | 3946 |
chunksize argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:data_chunks = pd.read_csv("data/microbiome.csv", chunksize=15)
data_chunks
<pandas.io.parsers.readers.TextFileReader at 0x7f4741161460>
NA and NULL.!head data/microbiome_missing.csv
Taxon,Patient,Tissue,Stool Firmicutes,1,632,305 Firmicutes,2,136,4182 Firmicutes,3,,703 Firmicutes,4,408,3946 Firmicutes,5,831,8605 Firmicutes,6,693,50 Firmicutes,7,718,717 Firmicutes,8,173,33 Firmicutes,9,228,NA
import pandas as pd
pd.read_csv("data/microbiome_missing.csv").head(20)
| Taxon | Patient | Tissue | Stool | |
|---|---|---|---|---|
| 0 | Firmicutes | 1 | 632 | 305.0 |
| 1 | Firmicutes | 2 | 136 | 4182.0 |
| 2 | Firmicutes | 3 | NaN | 703.0 |
| 3 | Firmicutes | 4 | 408 | 3946.0 |
| 4 | Firmicutes | 5 | 831 | 8605.0 |
| 5 | Firmicutes | 6 | 693 | 50.0 |
| 6 | Firmicutes | 7 | 718 | 717.0 |
| 7 | Firmicutes | 8 | 173 | 33.0 |
| 8 | Firmicutes | 9 | 228 | NaN |
| 9 | Firmicutes | 10 | 162 | 3196.0 |
| 10 | Firmicutes | 11 | 372 | -99999.0 |
| 11 | Firmicutes | 12 | 4255 | 4361.0 |
| 12 | Firmicutes | 13 | 107 | 1667.0 |
| 13 | Firmicutes | 14 | ? | 223.0 |
| 14 | Firmicutes | 15 | 281 | 2377.0 |
| 15 | Proteobacteria | 1 | 1638 | 3886.0 |
| 16 | Proteobacteria | 2 | 2469 | 1821.0 |
| 17 | Proteobacteria | 3 | 839 | 661.0 |
| 18 | Proteobacteria | 4 | 4414 | 18.0 |
| 19 | Proteobacteria | 5 | 12044 | 83.0 |
Above, Pandas recognized NA and an empty field as missing data.
pd.isnull(pd.read_csv("data/microbiome_missing.csv")).head(20)
| Taxon | Patient | Tissue | Stool | |
|---|---|---|---|---|
| 0 | False | False | False | False |
| 1 | False | False | False | False |
| 2 | False | False | True | False |
| 3 | False | False | False | False |
| 4 | False | False | False | False |
| 5 | False | False | False | False |
| 6 | False | False | False | False |
| 7 | False | False | False | False |
| 8 | False | False | False | True |
| 9 | False | False | False | False |
| 10 | False | False | False | False |
| 11 | False | False | False | False |
| 12 | False | False | False | False |
| 13 | False | False | False | False |
| 14 | False | False | False | False |
| 15 | False | False | False | False |
| 16 | False | False | False | False |
| 17 | False | False | False | False |
| 18 | False | False | False | False |
| 19 | False | False | False | False |
Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the na_values argument:
pd.read_csv("data/microbiome_missing.csv", na_values=['?', -99999]).head(20)
| Taxon | Patient | Tissue | Stool | |
|---|---|---|---|---|
| 0 | Firmicutes | 1 | 632.0 | 305.0 |
| 1 | Firmicutes | 2 | 136.0 | 4182.0 |
| 2 | Firmicutes | 3 | NaN | 703.0 |
| 3 | Firmicutes | 4 | 408.0 | 3946.0 |
| 4 | Firmicutes | 5 | 831.0 | 8605.0 |
| 5 | Firmicutes | 6 | 693.0 | 50.0 |
| 6 | Firmicutes | 7 | 718.0 | 717.0 |
| 7 | Firmicutes | 8 | 173.0 | 33.0 |
| 8 | Firmicutes | 9 | 228.0 | NaN |
| 9 | Firmicutes | 10 | 162.0 | 3196.0 |
| 10 | Firmicutes | 11 | 372.0 | NaN |
| 11 | Firmicutes | 12 | 4255.0 | 4361.0 |
| 12 | Firmicutes | 13 | 107.0 | 1667.0 |
| 13 | Firmicutes | 14 | NaN | 223.0 |
| 14 | Firmicutes | 15 | 281.0 | 2377.0 |
| 15 | Proteobacteria | 1 | 1638.0 | 3886.0 |
| 16 | Proteobacteria | 2 | 2469.0 | 1821.0 |
| 17 | Proteobacteria | 3 | 839.0 | 661.0 |
| 18 | Proteobacteria | 4 | 4414.0 | 18.0 |
| 19 | Proteobacteria | 5 | 12044.0 | 83.0 |
These can be specified on a column-wise basis using an appropriate dict as the argument for na_values.
Reindexing allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.
For some variety, we will leave our digestive tract bacteria behind and employ some baseball data.
baseball = pd.read_csv("data/baseball.csv", index_col='id')
baseball
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
| 88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89525 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 89526 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
| 89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
| 89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100 rows × 22 columns
id column as the index, since it appears to be an unique identifier. We could try to create a unique index ourselves by combining player and year:player_id = baseball.player + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_id
baseball_newind
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| womacto012006 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
| schilcu012006 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| myersmi012006 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| helliri012006 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| johnsra052006 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| benitar012007 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| benitar012007 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ausmubr012007 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
| aloumo012007 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
| alomasa022007 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100 rows × 22 columns
baseball_newind.index.is_unique
False
reverse_index = baseball.index[::-1]
baseball.reindex(reverse_index)
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
| 89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
| 89526 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 89525 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
100 rows × 22 columns
id index is not sequential. Say we wanted to populate the table with every id value. We could specify and index that is a sequence from the first to the last id numbers in the database, and Pandas would fill in the missing data with NaN values:id_range = range(baseball.index.values.min(), baseball.index.values.max())
baseball.reindex(id_range)
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 88641 | womacto01 | 2006.0 | 2.0 | CHN | NL | 19.0 | 50.0 | 6.0 | 14.0 | 1.0 | ... | 2.0 | 1.0 | 1.0 | 4.0 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
| 88642 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 88643 | schilcu01 | 2006.0 | 1.0 | BOS | AL | 31.0 | 2.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 88645 | myersmi01 | 2006.0 | 1.0 | NYA | AL | 62.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89529 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 89530 | ausmubr01 | 2007.0 | 1.0 | HOU | NL | 117.0 | 349.0 | 38.0 | 82.0 | 16.0 | ... | 25.0 | 6.0 | 1.0 | 37.0 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
| 89531 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 89532 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 89533 | aloumo01 | 2007.0 | 1.0 | NYN | NL | 87.0 | 328.0 | 51.0 | 112.0 | 19.0 | ... | 49.0 | 3.0 | 0.0 | 27.0 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
893 rows × 22 columns
baseball.reindex(id_range, fill_value='mr.nobody', columns=['player'])
| player | |
|---|---|
| id | |
| 88641 | womacto01 |
| 88642 | mr.nobody |
| 88643 | schilcu01 |
| 88644 | mr.nobody |
| 88645 | myersmi01 |
| ... | ... |
| 89529 | mr.nobody |
| 89530 | ausmubr01 |
| 89531 | mr.nobody |
| 89532 | mr.nobody |
| 89533 | aloumo01 |
893 rows × 1 columns
baseball_newind.sort_index()
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| alomasa022007 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| aloumo012007 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
| ausmubr012007 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
| benitar012007 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| benitar012007 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| wickmbo012007 | wickmbo01 | 2007 | 1 | ATL | NL | 47 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| williwo022007 | williwo02 | 2007 | 1 | HOU | NL | 33 | 59 | 3 | 6 | 0 | ... | 2.0 | 0.0 | 0.0 | 0 | 25.0 | 0.0 | 0.0 | 5.0 | 0.0 | 1.0 |
| witasja012007 | witasja01 | 2007 | 1 | TBA | AL | 3 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| womacto012006 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
| zaungr012007 | zaungr01 | 2007 | 1 | TOR | AL | 110 | 331 | 43 | 80 | 24 | ... | 52.0 | 0.0 | 0.0 | 51 | 55.0 | 8.0 | 2.0 | 1.0 | 6.0 | 9.0 |
100 rows × 22 columns
baseball_newind.sort_index(ascending=False)
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| zaungr012007 | zaungr01 | 2007 | 1 | TOR | AL | 110 | 331 | 43 | 80 | 24 | ... | 52.0 | 0.0 | 0.0 | 51 | 55.0 | 8.0 | 2.0 | 1.0 | 6.0 | 9.0 |
| womacto012006 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
| witasja012007 | witasja01 | 2007 | 1 | TBA | AL | 3 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| williwo022007 | williwo02 | 2007 | 1 | HOU | NL | 33 | 59 | 3 | 6 | 0 | ... | 2.0 | 0.0 | 0.0 | 0 | 25.0 | 0.0 | 0.0 | 5.0 | 0.0 | 1.0 |
| wickmbo012007 | wickmbo01 | 2007 | 1 | ATL | NL | 47 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| benitar012007 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| benitar012007 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ausmubr012007 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
| aloumo012007 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
| alomasa022007 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100 rows × 22 columns
baseball_newind.sort_index(axis=1)
| X2b | X3b | ab | bb | cs | g | gidp | h | hbp | hr | ... | player | r | rbi | sb | sf | sh | so | stint | team | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| womacto012006 | 1 | 0 | 50 | 4 | 1.0 | 19 | 0.0 | 14 | 0.0 | 1 | ... | womacto01 | 6 | 2.0 | 1.0 | 0.0 | 3.0 | 4.0 | 2 | CHN | 2006 |
| schilcu012006 | 0 | 0 | 2 | 0 | 0.0 | 31 | 0.0 | 1 | 0.0 | 0 | ... | schilcu01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | BOS | 2006 |
| myersmi012006 | 0 | 0 | 0 | 0 | 0.0 | 62 | 0.0 | 0 | 0.0 | 0 | ... | myersmi01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | NYA | 2006 |
| helliri012006 | 0 | 0 | 3 | 0 | 0.0 | 20 | 0.0 | 0 | 0.0 | 0 | ... | helliri01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1 | MIL | 2006 |
| johnsra052006 | 0 | 0 | 6 | 0 | 0.0 | 33 | 0.0 | 1 | 0.0 | 0 | ... | johnsra05 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1 | NYA | 2006 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| benitar012007 | 0 | 0 | 0 | 0 | 0.0 | 34 | 0.0 | 0 | 0.0 | 0 | ... | benitar01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | FLO | 2007 |
| benitar012007 | 0 | 0 | 0 | 0 | 0.0 | 19 | 0.0 | 0 | 0.0 | 0 | ... | benitar01 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | SFN | 2007 |
| ausmubr012007 | 16 | 3 | 349 | 37 | 1.0 | 117 | 11.0 | 82 | 6.0 | 3 | ... | ausmubr01 | 38 | 25.0 | 6.0 | 1.0 | 4.0 | 74.0 | 1 | HOU | 2007 |
| aloumo012007 | 19 | 1 | 328 | 27 | 0.0 | 87 | 13.0 | 112 | 2.0 | 13 | ... | aloumo01 | 51 | 49.0 | 3.0 | 3.0 | 0.0 | 30.0 | 1 | NYN | 2007 |
| alomasa022007 | 1 | 0 | 22 | 0 | 0.0 | 8 | 0.0 | 3 | 0.0 | 0 | ... | alomasa02 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1 | NYN | 2007 |
100 rows × 22 columns
baseball.hr.rank()
id
88641 62.5
88643 29.0
88645 29.0
88649 29.0
88650 29.0
...
89525 29.0
89526 29.0
89530 71.5
89533 88.0
89534 29.0
Name: hr, Length: 100, dtype: float64
baseball.hr.rank(method='first')
id
88641 58.0
88643 1.0
88645 2.0
88649 3.0
88650 4.0
...
89525 55.0
89526 56.0
89530 72.0
89533 88.0
89534 57.0
Name: hr, Length: 100, dtype: float64
DataFrame's rank method results in the ranks of all columns:baseball.rank(ascending=False)
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 88641 | 2.0 | 96.5 | 7.0 | 82.0 | 31.5 | 70.0 | 47.5 | 40.5 | 39.0 | 50.5 | ... | 51.0 | 24.5 | 17.5 | 44.5 | 59.0 | 66.0 | 65.5 | 16.0 | 70.0 | 76.5 |
| 88643 | 37.5 | 96.5 | 57.0 | 88.0 | 81.5 | 55.5 | 73.0 | 81.0 | 63.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 73.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
| 88645 | 47.5 | 96.5 | 57.0 | 40.5 | 81.5 | 36.0 | 91.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 89.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
| 88649 | 66.0 | 96.5 | 57.0 | 47.0 | 31.5 | 67.5 | 69.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 67.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
| 88650 | 61.5 | 96.5 | 57.0 | 40.5 | 81.5 | 51.0 | 64.5 | 81.0 | 63.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 59.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89525 | 96.5 | 46.5 | 7.0 | 64.0 | 31.5 | 47.0 | 91.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 89.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
| 89526 | 96.5 | 46.5 | 57.0 | 13.5 | 31.5 | 70.0 | 91.0 | 81.0 | 84.5 | 78.0 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 89.0 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
| 89530 | 98.0 | 46.5 | 57.0 | 61.5 | 31.5 | 17.5 | 19.0 | 24.0 | 23.0 | 21.5 | ... | 27.0 | 7.0 | 17.5 | 18.5 | 10.0 | 18.0 | 6.5 | 12.0 | 33.5 | 14.0 |
| 89533 | 99.0 | 46.5 | 57.0 | 31.5 | 31.5 | 23.0 | 22.0 | 18.5 | 14.0 | 17.5 | ... | 18.0 | 14.0 | 62.5 | 22.0 | 27.0 | 11.0 | 21.0 | 67.5 | 15.5 | 10.5 |
| 89534 | 100.0 | 46.5 | 57.0 | 31.5 | 31.5 | 77.0 | 57.0 | 58.0 | 58.0 | 50.5 | ... | 78.5 | 63.5 | 62.5 | 79.0 | 63.5 | 66.0 | 65.5 | 67.5 | 70.0 | 76.5 |
100 rows × 22 columns
baseball[['r','h','hr']].rank(ascending=False)
| r | h | hr | |
|---|---|---|---|
| id | |||
| 88641 | 40.5 | 39.0 | 38.5 |
| 88643 | 81.0 | 63.5 | 72.0 |
| 88645 | 81.0 | 84.5 | 72.0 |
| 88649 | 81.0 | 84.5 | 72.0 |
| 88650 | 81.0 | 63.5 | 72.0 |
| ... | ... | ... | ... |
| 89525 | 81.0 | 84.5 | 72.0 |
| 89526 | 81.0 | 84.5 | 72.0 |
| 89530 | 24.0 | 23.0 | 29.5 |
| 89533 | 18.5 | 14.0 | 13.0 |
| 89534 | 58.0 | 58.0 | 72.0 |
100 rows × 3 columns
drop method:¶baseball.shape
(100, 22)
baseball.drop([89525, 89526]) # does not modify the original DataFrame
| player | year | stint | team | lg | g | ab | r | h | X2b | ... | rbi | sb | cs | bb | so | ibb | hbp | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | ... | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 |
| 88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89521 | bondsba01 | 2007 | 1 | SFN | NL | 126 | 340 | 75 | 94 | 14 | ... | 66.0 | 5.0 | 0.0 | 132 | 54.0 | 43.0 | 3.0 | 0.0 | 2.0 | 13.0 |
| 89523 | biggicr01 | 2007 | 1 | HOU | NL | 141 | 517 | 68 | 130 | 31 | ... | 50.0 | 4.0 | 3.0 | 23 | 112.0 | 0.0 | 3.0 | 7.0 | 5.0 | 5.0 |
| 89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | ... | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 3.0 | 6.0 | 4.0 | 1.0 | 11.0 |
| 89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | ... | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 5.0 | 2.0 | 0.0 | 3.0 | 13.0 |
| 89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
98 rows × 22 columns
baseball.shape
(100, 22)
baseball.drop(['ibb','hbp'], axis=1) # Pandas axis=0 indicating row, axis=1 indicating column
| player | year | stint | team | lg | g | ab | r | h | X2b | X3b | hr | rbi | sb | cs | bb | so | sh | sf | gidp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||||||||||
| 88641 | womacto01 | 2006 | 2 | CHN | NL | 19 | 50 | 6 | 14 | 1 | 0 | 1 | 2.0 | 1.0 | 1.0 | 4 | 4.0 | 3.0 | 0.0 | 0.0 |
| 88643 | schilcu01 | 2006 | 1 | BOS | AL | 31 | 2 | 0 | 1 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 88645 | myersmi01 | 2006 | 1 | NYA | AL | 62 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 88649 | helliri01 | 2006 | 1 | MIL | NL | 20 | 3 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 |
| 88650 | johnsra05 | 2006 | 1 | NYA | AL | 33 | 6 | 0 | 1 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 4.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89525 | benitar01 | 2007 | 2 | FLO | NL | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 89526 | benitar01 | 2007 | 1 | SFN | NL | 19 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 89530 | ausmubr01 | 2007 | 1 | HOU | NL | 117 | 349 | 38 | 82 | 16 | 3 | 3 | 25.0 | 6.0 | 1.0 | 37 | 74.0 | 4.0 | 1.0 | 11.0 |
| 89533 | aloumo01 | 2007 | 1 | NYN | NL | 87 | 328 | 51 | 112 | 19 | 1 | 13 | 49.0 | 3.0 | 0.0 | 27 | 30.0 | 0.0 | 3.0 | 13.0 |
| 89534 | alomasa02 | 2007 | 1 | NYN | NL | 8 | 22 | 1 | 3 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 |
100 rows × 20 columns
DataFrame and Series objects allow for several operations to take place either on a single object, or between two or more objects.
For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years:
hr2006 = baseball[baseball.year==2006].xs('hr', axis=1)
hr2006.index = baseball.player[baseball.year==2006]
hr2007 = baseball[baseball.year==2007].xs('hr', axis=1)
hr2007.index = baseball.player[baseball.year==2007]
hr2006 = pd.Series(baseball.hr[baseball.year==2006].values, index=baseball.player[baseball.year==2006])
hr2006
player womacto01 1 schilcu01 0 myersmi01 0 helliri01 0 johnsra05 0 finlest01 6 gonzalu01 15 seleaa01 0 dtype: int64
hr2007 = pd.Series(baseball.hr[baseball.year==2007].values, index=baseball.player[baseball.year==2007])
hr2007
player
francju01 0
francju01 1
zaungr01 10
witasja01 0
williwo02 1
..
benitar01 0
benitar01 0
ausmubr01 3
aloumo01 13
alomasa02 0
Length: 92, dtype: int64
hr_total = hr2006 + hr2007
hr_total
player
alomasa02 NaN
aloumo01 NaN
ausmubr01 NaN
benitar01 NaN
benitar01 NaN
..
wickmbo01 NaN
williwo02 NaN
witasja01 NaN
womacto01 NaN
zaungr01 NaN
Length: 94, dtype: float64
NaN values for labels that do not overlap in the two Series. In fact, there are only 6 players that occur in both years.hr_total[hr_total.notnull()]
player finlest01 7.0 gonzalu01 30.0 johnsra05 0.0 myersmi01 0.0 schilcu01 0.0 seleaa01 0.0 dtype: float64
In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the Automatic Identification System (AIS), a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed.
The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. AIS signals are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both spatial and temporal information.
For our purposes, we will use summarized data that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc.
segments = pd.read_csv("data/AIS/transit_segments.csv")
segments
| mmsi | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
| 1 | 1 | Dredge Capt Frank | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 |
| 2 | 1 | Us Gov Vessel | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 |
| 3 | 1 | Us Gov Vessel | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 |
| 4 | 1 | Dredge Capt Frank | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 262521 | 999999999 | Triple Attraction | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 |
| 262522 | 999999999 | Triple Attraction | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 |
| 262523 | 999999999 | Triple Attraction | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 |
| 262524 | 999999999 | Triple Attraction | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 |
| 262525 | 999999999 | Triple Attraction | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 |
262526 rows × 11 columns
data/AIS folder there is a second table that contains information about each of the ships that traveled the segments in the segments table.vessels = pd.read_csv("data/AIS/vessel_information.csv", index_col='mmsi')
vessels
| num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | |
|---|---|---|---|---|---|---|---|---|---|---|
| mmsi | ||||||||||
| 1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing |
| 9 | 3 | 000000009/Raven/Shearwater | N | Unknown | Unknown | 2 | 50.0/62.0 | 62.0 | 2 | Pleasure/Tug |
| 21 | 1 | Us Gov Vessel | Y | Unknown | Unknown | 1 | 208.0 | 208.0 | 1 | Unknown |
| 74 | 2 | Mcfaul/Sarah Bell | N | Unknown | Unknown | 1 | 155.0 | 155.0 | 1 | Unknown |
| 103 | 3 | Ron G/Us Navy Warship 103/Us Warship 103 | Y | Unknown | Unknown | 2 | 26.0/155.0 | 155.0 | 2 | Tanker/Unknown |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 919191919 | 1 | Oi | N | Unknown | Unknown | 1 | 20.0 | 20.0 | 1 | Pleasure |
| 967191190 | 1 | Pathfinder | N | Unknown | Unknown | 1 | 31.0 | 31.0 | 2 | BigTow/Towing |
| 975318642 | 1 | Island Express | N | Unknown | Unknown | 1 | 20.0 | 20.0 | 1 | Towing |
| 987654321 | 2 | Island Lookout/Island Tide | N | Unknown | Unknown | 2 | 22.0/23.0 | 23.0 | 2 | Fishing/Towing |
| 999999999 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure |
10771 rows × 10 columns
vessels.type.value_counts()
Cargo 5622
Tanker 2440
Pleasure 601
Tug 221
Sailing 205
...
AntiPol/Other 1
Fishing/Law 1
Cargo/Other/Towing 1
Cargo/Fishing 1
Fishing/Reserved/Towing 1
Name: type, Length: 206, dtype: int64
The challenge, however, is that several ships have travelled multiple segments, so there is not a one-to-one relationship between the rows of the two tables. The table of vessel information has a one-to-many relationship with the segments.
In Pandas, we can combine tables according to the value of one or more keys that are used to identify rows, much like an index. Using a trivial example:
import numpy as np
df1 = pd.DataFrame(dict(id=range(4), age=np.random.randint(18, 31, size=4)))
df2 = pd.DataFrame(dict(id=range(6), score=np.random.random(size=6)))
pd.merge(df1, df2)
| id | age | score | |
|---|---|---|---|
| 0 | 0 | 24 | 0.666117 |
| 1 | 1 | 23 | 0.249913 |
| 2 | 2 | 26 | 0.261396 |
| 3 | 3 | 30 | 0.651306 |
Notice that without any information about which column to use as a key, Pandas did the right thing and used the id column in both tables. Unless specified otherwise, merge will used any common column names as keys for merging the tables.
By default, merge performs an inner join on the tables, meaning that the merged table represents an intersection of the two tables.
pd.merge(df1, df2, how='outer')
| id | age | score | |
|---|---|---|---|
| 0 | 0 | 24.0 | 0.666117 |
| 1 | 1 | 23.0 | 0.249913 |
| 2 | 2 | 26.0 | 0.261396 |
| 3 | 3 | 30.0 | 0.651306 |
| 4 | 4 | NaN | 0.468143 |
| 5 | 5 | NaN | 0.461524 |
The outer join above yields the union of the two tables, so all rows are represented, with missing values inserted as appropriate. One can also perform right and left joins to include all rows of the right or left table (i.e. first or second argument to merge), but not necessarily the other.
Looking at the two datasets that we wish to merge:
segments.head(1)
| mmsi | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
vessels.head(1)
| num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | |
|---|---|---|---|---|---|---|---|---|---|---|
| mmsi | ||||||||||
| 1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing |
mmsi value (a vessel identifier) in each table, but it is used as an index for the vessels table. In this case, we have to specify to join on the index for this table, and on the mmsi column for the other.segments_merged = pd.merge(vessels, segments, left_index=True, right_on='mmsi')
segments_merged
| num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | ... | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
| 1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 |
| 2 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 |
| 3 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 |
| 4 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 262521 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 |
| 262522 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 |
| 262523 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 |
| 262524 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 |
| 262525 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 |
262353 rows × 21 columns
In this case, the default inner join is suitable; we are not interested in observations from either table that do not have corresponding entries in the other.
Notice that mmsi field that was an index on the vessels table is no longer an index on the merged table.
Here, we used the merge function to perform the merge; we could also have used the merge method for either of the tables:
vessels.merge(segments, left_index=True, right_on='mmsi')
| num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type | ... | name | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Govt Ves | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 |
| 1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 |
| 2 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 |
| 3 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Us Gov Vessel | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 |
| 4 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | Dredge Capt Frank | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 262521 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 |
| 262522 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 |
| 262523 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 |
| 262524 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 |
| 262525 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | Triple Attraction | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 |
262353 rows × 21 columns
_x and _y to the columns to uniquely identify them.segments['type'] = 'foo'
pd.merge(vessels, segments, left_index=True, right_on='mmsi')
| num_names | names | sov | flag | flag_type | num_loas | loa | max_loa | num_types | type_x | ... | transit | segment | seg_length | avg_sog | min_sog | max_sog | pdgt10 | st_time | end_time | type_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 1 | 1 | 5.1 | 13.2 | 9.2 | 14.5 | 96.5 | 2/10/09 16:03 | 2/10/09 16:27 | foo |
| 1 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 1 | 1 | 13.5 | 18.6 | 10.4 | 20.6 | 100.0 | 4/6/09 14:31 | 4/6/09 15:20 | foo |
| 2 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 1 | 1 | 4.3 | 16.2 | 10.3 | 20.5 | 100.0 | 4/6/09 14:36 | 4/6/09 14:55 | foo |
| 3 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 2 | 1 | 9.2 | 15.4 | 14.5 | 16.1 | 100.0 | 4/10/09 17:58 | 4/10/09 18:34 | foo |
| 4 | 8 | Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho... | Y | Unknown | Unknown | 7 | 42.0/48.0/57.0/90.0/138.0/154.0/156.0 | 156.0 | 4 | Dredging/MilOps/Reserved/Towing | ... | 2 | 1 | 9.2 | 15.4 | 14.6 | 16.2 | 100.0 | 4/10/09 17:59 | 4/10/09 18:35 | foo |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 262521 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 3 | 1 | 5.3 | 20.0 | 19.6 | 20.4 | 100.0 | 6/15/10 12:49 | 6/15/10 13:05 | foo |
| 262522 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 4 | 1 | 18.7 | 19.2 | 18.4 | 19.9 | 100.0 | 6/15/10 21:32 | 6/15/10 22:29 | foo |
| 262523 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 6 | 1 | 17.4 | 17.0 | 14.7 | 18.4 | 100.0 | 6/17/10 19:16 | 6/17/10 20:17 | foo |
| 262524 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 7 | 1 | 31.5 | 14.2 | 13.4 | 15.1 | 100.0 | 6/18/10 2:52 | 6/18/10 5:03 | foo |
| 262525 | 1 | Triple Attraction | N | Unknown | Unknown | 1 | 30.0 | 30.0 | 1 | Pleasure | ... | 8 | 1 | 19.8 | 18.6 | 16.1 | 19.5 | 100.0 | 6/18/10 10:19 | 6/18/10 11:22 | foo |
262353 rows × 22 columns
suffixes argument, containing a list of the suffixes to be used for the columns of the left and right columns, respectively.A common data manipulation is appending rows or columns to a dataset that already conform to the dimensions of the exsiting rows or colums, respectively. In NumPy, this is done either with concatenate or the convenience functions c_ and r_:
np.concatenate([np.random.random(5), np.random.random(5)])
array([0.42336207, 0.74807952, 0.61839076, 0.54794432, 0.06227732,
0.71618874, 0.31763132, 0.26021656, 0.22395665, 0.08499033])
np.r_[np.random.random(5), np.random.random(5)]
array([0.48601962, 0.11484701, 0.93892836, 0.16884999, 0.71700162,
0.92519913, 0.26827622, 0.41866975, 0.59348726, 0.06054373])
np.c_[np.random.random(5), np.random.random(5)]
array([[0.48861909, 0.23961022],
[0.68685816, 0.7662155 ],
[0.76304197, 0.63356894],
[0.45533848, 0.36265383],
[0.85205653, 0.84605096]])
This operation is also called binding or stacking. With Pandas' indexed data structures, there are additional considerations as the overlap in index values between two data structures affects how they are concatenate.
Lets import two microbiome datasets, each consisting of counts of microorganiams from a particular patient. We will use the first column of each dataset as the index.
# Pandas requires external modules to read Excel files
! pip3 install xlrd openpyxl --user
Looking in indexes: https://mirrors.163.com/pypi/simple/ Requirement already satisfied: xlrd in /home/fli/.local/lib/python3.9/site-packages (2.0.1) Requirement already satisfied: openpyxl in /home/fli/.local/lib/python3.9/site-packages (3.0.9) Requirement already satisfied: et-xmlfile in /home/fli/.local/lib/python3.9/site-packages (from openpyxl) (1.0.1)
mb1 = pd.read_excel('data/microbiome/MID1.xls', 'Sheet 1', index_col=0, header=None)
mb2 = pd.read_excel('data/microbiome/MID2.xls', 'Sheet 1', index_col=0, header=None)
mb1.shape, mb2.shape
((272, 1), (288, 1))
mb1
| 1 | |
|---|---|
| 0 | |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2 |
| Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3 |
| Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3 |
| Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7 |
| ... | ... |
| Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga | 9 |
| Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus | 1 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | 2 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 |
272 rows × 1 columns
mb1.columns = mb2.columns = ['Count']
mb1.index.name = mb2.index.name = 'Taxon'
mb1
| Count | |
|---|---|
| Taxon | |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2 |
| Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3 |
| Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3 |
| Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7 |
| ... | ... |
| Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga | 9 |
| Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus | 1 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | 2 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 |
272 rows × 1 columns
mb1.index[:3]
Index(['Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera',
'Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus',
'Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus'],
dtype='object', name='Taxon')
mb1.index.is_unique
True
axis=0 (the default), we will obtain another data frame with the the rows concatenated:pd.concat([mb1, mb2], axis=0).shape
(560, 1)
However, the index is no longer unique, due to overlap between the two DataFrames.
pd.concat([mb1, mb2], axis=0).index.is_unique
False
axis=1 will concatenate column-wise, but respecting the indices of the two DataFrames.pd.concat([mb1, mb2], axis=1).shape
(438, 2)
pd.concat([mb1, mb2], axis=1)
| Count | Count | |
|---|---|---|
| Taxon | ||
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7.0 | 23.0 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2.0 | 2.0 |
| Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3.0 | 10.0 |
| Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3.0 | 9.0 |
| Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7.0 | 9.0 |
| ... | ... | ... |
| Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira | NaN | 1.0 |
| Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio | NaN | 9.0 |
| Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga | NaN | 1.0 |
| Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator | NaN | 3.0 |
| Bacteria TM7 TM7_genera_incertae_sedis | NaN | 2.0 |
438 rows × 2 columns
pd.concat([mb1, mb2], axis=1).values[:5]
array([[ 7., 23.],
[ 2., 2.],
[ 3., 10.],
[ 3., 9.],
[ 7., 9.]])
join=inner argument.pd.concat([mb1, mb2], axis=1, join='inner')
| Count | Count | |
|---|---|---|
| Taxon | ||
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 | 23 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2 | 2 |
| Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3 | 10 |
| Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3 | 9 |
| Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7 | 9 |
| ... | ... | ... |
| Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium | 1 | 1 |
| Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Geotoga | 7 | 15 |
| Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga | 9 | 22 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 | 1 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 | 2 |
122 rows × 2 columns
combine_first.mb1.combine_first(mb2)
| Count | |
|---|---|
| Taxon | |
| Archaea "Crenarchaeota" Thermoprotei Acidilobales Acidilobaceae Acidilobus | 2 |
| Archaea "Crenarchaeota" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera | 14 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus | 1 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera | 2 |
| ... | ... |
| Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Alterococcus | 1 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Bangiophyceae | 2 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae | 85 |
| Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta | 1388 |
| Bacteria TM7 TM7_genera_incertae_sedis | 2 |
438 rows × 1 columns
pd.concat(dict(patient1=mb1, patient2=mb2), axis=1)
| patient1 | patient2 | |
|---|---|---|
| Count | Count | |
| Taxon | ||
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera | 7.0 | 23.0 |
| Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus | 2.0 | 2.0 |
| Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus | 3.0 | 10.0 |
| Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum | 3.0 | 9.0 |
| Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella | 7.0 | 9.0 |
| ... | ... | ... |
| Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira | NaN | 1.0 |
| Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Hydrogenovibrio | NaN | 9.0 |
| Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga | NaN | 1.0 |
| Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator | NaN | 3.0 |
| Bacteria TM7 TM7_genera_incertae_sedis | NaN | 2.0 |
438 rows × 2 columns
concat to work like numpy.concatanate, you may provide the ignore_index=True argument.The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.
Missing data are represented in Series and DataFrame objects by the NaN floating point value. However, None is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).
import numpy as np
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo
0 NaN 1 -3 2 None 3 foobar dtype: object
foo.isnull()
0 True 1 False 2 True 3 False dtype: bool
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2
Cyanobacteria NaN Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
bacteria2.dropna()
Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
bacteria2[bacteria2.notnull()]
Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
how='all' argument, which only drops a row when every field is a missing value.data.dropna(how='all')
thresh argument.data.dropna(thresh=4)
This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.
If we want to drop missing values column-wise instead of row-wise, we use axis=1.
data.dropna(axis=1)
fillna argument.bacteria2.fillna(0)
Cyanobacteria 0.0 Firmicutes 632.0 Proteobacteria 1638.0 Actinobacteria 569.0 dtype: float64
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
'treatment':[1, 1, 1, None, 2, 2, 2, 2],
'year':[1994,1997,1999, None,2015,2017,2019,2021]})
data.fillna({'year': 2013, 'treatment':2})
| value | treatment | year | |
|---|---|---|---|
| 0 | 632 | 1.0 | 1994.0 |
| 1 | 1638 | 1.0 | 1997.0 |
| 2 | 569 | 1.0 | 1999.0 |
| 3 | 115 | 2.0 | 2013.0 |
| 4 | 433 | 2.0 | 2015.0 |
| 5 | 1130 | 2.0 | 2017.0 |
| 6 | 754 | 2.0 | 2019.0 |
| 7 | 555 | 2.0 | 2021.0 |
Notice that fillna by default returns a new object with the desired filling behavior, rather than changing the Series or DataFrame in place.
We can alter values in-place using inplace=True.
One of the most powerful features of Pandas is its GroupBy functionality. On occasion we may want to perform operations on groups of observations within a dataset. For exmaple:
cdystonia = pd.read_csv("data/cdystonia.csv")
cdystonia
| patient | obs | week | site | id | treat | age | sex | twstrs | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 1 | 1 | 5000U | 65 | F | 32 |
| 1 | 1 | 2 | 2 | 1 | 1 | 5000U | 65 | F | 30 |
| 2 | 1 | 3 | 4 | 1 | 1 | 5000U | 65 | F | 24 |
| 3 | 1 | 4 | 8 | 1 | 1 | 5000U | 65 | F | 37 |
| 4 | 1 | 5 | 12 | 1 | 1 | 5000U | 65 | F | 39 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 626 | 109 | 1 | 0 | 9 | 11 | 5000U | 57 | M | 53 |
| 627 | 109 | 2 | 2 | 9 | 11 | 5000U | 57 | M | 38 |
| 628 | 109 | 4 | 8 | 9 | 11 | 5000U | 57 | M | 33 |
| 629 | 109 | 5 | 12 | 9 | 11 | 5000U | 57 | M | 36 |
| 630 | 109 | 6 | 16 | 9 | 11 | 5000U | 57 | M | 51 |
631 rows × 9 columns
cdystonia_grouped = cdystonia.groupby(cdystonia.patient)
cdystonia_grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f473e94b5e0>
# for patient, group in cdystonia_grouped:
# print(patient)
# print(group)
A common data analysis procedure is the split-apply-combine operation, which groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table. For example, we may want to aggregate our data with with some function.
We can aggregate in Pandas using the aggregate (or agg, for short) method:
import numpy as np
cdystonia_grouped.agg(np.mean)
| obs | week | site | id | age | twstrs | |
|---|---|---|---|---|---|---|
| patient | ||||||
| 1 | 3.5 | 7.0 | 1.0 | 1.0 | 65.0 | 33.000000 |
| 2 | 3.5 | 7.0 | 1.0 | 2.0 | 70.0 | 47.666667 |
| 3 | 3.5 | 7.0 | 1.0 | 3.0 | 64.0 | 30.500000 |
| 4 | 2.5 | 3.5 | 1.0 | 4.0 | 59.0 | 60.000000 |
| 5 | 3.5 | 7.0 | 1.0 | 5.0 | 76.0 | 46.166667 |
| ... | ... | ... | ... | ... | ... | ... |
| 105 | 3.5 | 7.0 | 9.0 | 7.0 | 79.0 | 43.666667 |
| 106 | 3.5 | 7.0 | 9.0 | 8.0 | 43.0 | 67.666667 |
| 107 | 3.5 | 7.0 | 9.0 | 9.0 | 50.0 | 42.000000 |
| 108 | 3.5 | 7.0 | 9.0 | 10.0 | 39.0 | 52.333333 |
| 109 | 3.6 | 7.6 | 9.0 | 11.0 | 57.0 | 42.200000 |
109 rows × 6 columns
Notice that the treat and sex variables are not included in the aggregation. Since it does not make sense to aggregate non-string variables, these columns are simply ignored by the method.
Some aggregation functions are so common that Pandas has a convenience method for them, such as mean:
cdystonia_grouped.mean()
| obs | week | site | id | age | twstrs | |
|---|---|---|---|---|---|---|
| patient | ||||||
| 1 | 3.5 | 7.0 | 1.0 | 1.0 | 65.0 | 33.000000 |
| 2 | 3.5 | 7.0 | 1.0 | 2.0 | 70.0 | 47.666667 |
| 3 | 3.5 | 7.0 | 1.0 | 3.0 | 64.0 | 30.500000 |
| 4 | 2.5 | 3.5 | 1.0 | 4.0 | 59.0 | 60.000000 |
| 5 | 3.5 | 7.0 | 1.0 | 5.0 | 76.0 | 46.166667 |
| ... | ... | ... | ... | ... | ... | ... |
| 105 | 3.5 | 7.0 | 9.0 | 7.0 | 79.0 | 43.666667 |
| 106 | 3.5 | 7.0 | 9.0 | 8.0 | 43.0 | 67.666667 |
| 107 | 3.5 | 7.0 | 9.0 | 9.0 | 50.0 | 42.000000 |
| 108 | 3.5 | 7.0 | 9.0 | 10.0 | 39.0 | 52.333333 |
| 109 | 3.6 | 7.6 | 9.0 | 11.0 | 57.0 | 42.200000 |
109 rows × 6 columns
add_prefix and add_suffix methods can be used to give the columns of the resulting table labels that reflect the transformation:cdystonia_grouped.mean().add_suffix('_mean')
| obs_mean | week_mean | site_mean | id_mean | age_mean | twstrs_mean | |
|---|---|---|---|---|---|---|
| patient | ||||||
| 1 | 3.5 | 7.0 | 1.0 | 1.0 | 65.0 | 33.000000 |
| 2 | 3.5 | 7.0 | 1.0 | 2.0 | 70.0 | 47.666667 |
| 3 | 3.5 | 7.0 | 1.0 | 3.0 | 64.0 | 30.500000 |
| 4 | 2.5 | 3.5 | 1.0 | 4.0 | 59.0 | 60.000000 |
| 5 | 3.5 | 7.0 | 1.0 | 5.0 | 76.0 | 46.166667 |
| ... | ... | ... | ... | ... | ... | ... |
| 105 | 3.5 | 7.0 | 9.0 | 7.0 | 79.0 | 43.666667 |
| 106 | 3.5 | 7.0 | 9.0 | 8.0 | 43.0 | 67.666667 |
| 107 | 3.5 | 7.0 | 9.0 | 9.0 | 50.0 | 42.000000 |
| 108 | 3.5 | 7.0 | 9.0 | 10.0 | 39.0 | 52.333333 |
| 109 | 3.6 | 7.6 | 9.0 | 11.0 | 57.0 | 42.200000 |
109 rows × 6 columns
# The median of the `twstrs` variable
cdystonia_grouped['twstrs'].quantile(0.5)
patient
1 34.0
2 50.5
3 30.5
4 61.5
5 48.5
...
105 45.5
106 67.5
107 44.0
108 50.5
109 38.0
Name: twstrs, Length: 109, dtype: float64
cdystonia.groupby(['week','site']).mean()
| patient | obs | id | age | twstrs | ||
|---|---|---|---|---|---|---|
| week | site | |||||
| 0 | 1 | 6.5 | 1.0 | 6.5 | 59.000000 | 43.083333 |
| 2 | 19.5 | 1.0 | 7.5 | 53.928571 | 51.857143 | |
| 3 | 32.5 | 1.0 | 6.5 | 51.500000 | 38.750000 | |
| 4 | 42.5 | 1.0 | 4.5 | 59.250000 | 48.125000 | |
| 5 | 49.5 | 1.0 | 3.5 | 51.833333 | 49.333333 | |
| 6 | 60.0 | 1.0 | 8.0 | 51.866667 | 49.400000 | |
| 7 | 73.5 | 1.0 | 6.5 | 59.250000 | 44.333333 | |
| 8 | 89.0 | 1.0 | 10.0 | 57.263158 | 38.631579 | |
| 9 | 104.0 | 1.0 | 6.0 | 55.454545 | 52.727273 | |
| 2 | 1 | 6.5 | 2.0 | 6.5 | 59.000000 | 31.083333 |
| 2 | 19.0 | 2.0 | 7.0 | 52.923077 | 48.769231 | |
| 3 | 32.5 | 2.0 | 6.5 | 51.500000 | 32.416667 | |
| 4 | 42.5 | 2.0 | 4.5 | 59.250000 | 39.125000 | |
| 5 | 49.0 | 2.0 | 3.0 | 50.000000 | 44.200000 | |
| 6 | 60.0 | 2.0 | 8.0 | 51.866667 | 44.066667 | |
| 7 | 73.5 | 2.0 | 6.5 | 59.250000 | 32.916667 | |
| 8 | 88.5 | 2.0 | 9.5 | 58.562500 | 29.500000 | |
| 9 | 103.7 | 2.0 | 5.7 | 56.000000 | 41.600000 | |
| 4 | 1 | 6.5 | 3.0 | 6.5 | 59.000000 | 33.333333 |
| 2 | 19.5 | 3.0 | 7.5 | 53.928571 | 48.785714 |
transform method:cdystonia2 = cdystonia_grouped[["obs", "week", "twstrs"]]
normalize = lambda x: (x - x.mean())/x.std()
cdystonia2.transform(normalize)
| obs | week | twstrs | |
|---|---|---|---|
| 0 | -1.336306 | -1.135550 | -0.181369 |
| 1 | -0.801784 | -0.811107 | -0.544107 |
| 2 | -0.267261 | -0.486664 | -1.632322 |
| 3 | 0.267261 | 0.162221 | 0.725476 |
| 4 | 0.801784 | 0.811107 | 1.088214 |
| ... | ... | ... | ... |
| 626 | -1.253831 | -1.135467 | 1.180487 |
| 627 | -0.771589 | -0.836660 | -0.459078 |
| 628 | 0.192897 | 0.059761 | -1.005600 |
| 629 | 0.675140 | 0.657376 | -0.677687 |
| 630 | 1.157383 | 1.254990 | 0.961878 |
631 rows × 3 columns
groupby operations, if we are only interested split-apply-combine operations on a subset of columns:cdystonia_grouped['twstrs'].mean()
patient
1 33.000000
2 47.666667
3 30.500000
4 60.000000
5 46.166667
...
105 43.666667
106 67.666667
107 42.000000
108 52.333333
109 42.200000
Name: twstrs, Length: 109, dtype: float64
chunks = dict(list(cdystonia_grouped))
chunks[4]
| patient | obs | week | site | id | treat | age | sex | twstrs | |
|---|---|---|---|---|---|---|---|---|---|
| 18 | 4 | 1 | 0 | 1 | 4 | Placebo | 59 | F | 53 |
| 19 | 4 | 2 | 2 | 1 | 4 | Placebo | 59 | F | 61 |
| 20 | 4 | 3 | 4 | 1 | 4 | Placebo | 59 | F | 64 |
| 21 | 4 | 4 | 8 | 1 | 4 | Placebo | 59 | F | 62 |
groupby groups by row, but we can specify the axis argument to change this. For example, we can group our columns by type this way:dict(list(cdystonia.groupby(cdystonia.dtypes, axis=1)))
{dtype('int64'): patient obs week site id age twstrs
0 1 1 0 1 1 65 32
1 1 2 2 1 1 65 30
2 1 3 4 1 1 65 24
3 1 4 8 1 1 65 37
4 1 5 12 1 1 65 39
.. ... ... ... ... .. ... ...
626 109 1 0 9 11 57 53
627 109 2 2 9 11 57 38
628 109 4 8 9 11 57 33
629 109 5 12 9 11 57 36
630 109 6 16 9 11 57 51
[631 rows x 7 columns],
dtype('O'): treat sex
0 5000U F
1 5000U F
2 5000U F
3 5000U F
4 5000U F
.. ... ..
626 5000U M
627 5000U M
628 5000U M
629 5000U M
630 5000U M
[631 rows x 2 columns]}