Strings and Custom Functions in Pandas¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/python

Working with text columns¶

String methods in pandas¶

In [1]:
import pandas as pd
ledger_df = pd.read_excel('data/Q1Sales.xlsx') # read the first sheet
In [2]:
ledger_df['Channel'].str.upper()
Out[2]:
0            SHOPPE.COM
1               WALCART
2              BULLSEYE
3              BULLSEYE
4              BULLSEYE
              ...      
14049          BULLSEYE
14050          BULLSEYE
14051    UNDERSTOCK.COM
14052          IBAY.COM
14053    UNDERSTOCK.COM
Name: Channel, Length: 14054, dtype: object
In [4]:
ledger_df['Product Name']
Out[4]:
0                      Cannon Water Bomb Balloons 100 Pack
1        LEGO Ninja Turtles Stealth Shell in Pursuit 79102
2                                                      NaN
3        Transformers Age of Extinction Generations Del...
4        Transformers Age of Extinction Generations Del...
                               ...                        
14049    AC Adapter/Power Supply&Cord for Lenovo 3000 G...
14050         Cisco Systems Gigabit VPN Router (RV320K9NA)
14051    Philips AJ3116M/37 Digital Tuning Clock Radio ...
14052                                                  NaN
14053    Sirius Satellite Radio XADH2 Home Access Kit f...
Name: Product Name, Length: 14054, dtype: object
In [5]:
ledger_df['Product Name'].str.upper()
Out[5]:
0                      CANNON WATER BOMB BALLOONS 100 PACK
1        LEGO NINJA TURTLES STEALTH SHELL IN PURSUIT 79102
2                                                      NaN
3        TRANSFORMERS AGE OF EXTINCTION GENERATIONS DEL...
4        TRANSFORMERS AGE OF EXTINCTION GENERATIONS DEL...
                               ...                        
14049    AC ADAPTER/POWER SUPPLY&CORD FOR LENOVO 3000 G...
14050         CISCO SYSTEMS GIGABIT VPN ROUTER (RV320K9NA)
14051    PHILIPS AJ3116M/37 DIGITAL TUNING CLOCK RADIO ...
14052                                                  NaN
14053    SIRIUS SATELLITE RADIO XADH2 HOME ACCESS KIT F...
Name: Product Name, Length: 14054, dtype: object
In [6]:
ledger_df['Product Name'].str.contains('LEGO')
Out[6]:
0        False
1         True
2          NaN
3        False
4        False
         ...  
14049    False
14050    False
14051    False
14052      NaN
14053    False
Name: Product Name, Length: 14054, dtype: object
In [8]:
ledger_df[ledger_df['Product Name'].str.contains('LEGO').fillna(False)]
Out[8]:
InvoiceNo Channel Product Name ProductID Account AccountNo Date Deadline Currency Unit Price Quantity Total
1 1533 Walcart LEGO Ninja Turtles Stealth Shell in Pursuit 79102 T&G/LEG-37777 Sales 5004 2020-01-01 06/15/20 USD 6.70 1 6.70
43 1575 iBay.com LEGO Star Wars Clone Troopers vs Droidekas 75000 T&G/LEG-16040 Sales 5004 2020-01-01 3-22-20 USD 14.68 2 29.36
105 1637 Bullseye LEGO LOTR 79006 The Council of Elrond T&G/LEG-76682 Sales 5004 2020-01-01 2-19-20 USD 7.67 6 46.02
176 1708 Shoppe.com LEGO City Fire Chief Car 60001 T&G/LEG-89613 Sales 5004 2020-01-01 3-28-20 USD 24.95 1 24.95
228 1608 iBay.com LEGO Star Wars Clone Troopers vs Droidekas 75000 T&G/LEG-16040 Sales 5004 2020-01-01 3-22-20 USD 14.68 2 29.36
... ... ... ... ... ... ... ... ... ... ... ... ...
13525 15007 Understock.com LEGO City Trains High-speed Passenger Train 60... T&G/LEG-51950 Sales 5004 2020-01-30 4-24-20 USD 7.41 18 133.38
13550 15082 Walcart LEGO City Fire Chief Car 60001 T&G/LEG-89613 Sales 5004 2020-01-31 3-22-20 USD 25.11 8 200.88
13731 15131 Walcart LEGO City Fire Chief Car 60001 T&G/LEG-89613 Sales 5004 2020-01-31 3-22-20 USD 25.11 8 200.88
13753 15285 Understock.com LEGO Star Wars Clone Troopers vs Droidekas 75000 T&G/LEG-16040 Sales 5004 2020-01-31 7-28-20 USD 14.84 2 29.68
14031 15468 Understock.com LEGO Star Wars Clone Troopers vs Droidekas 75000 T&G/LEG-16040 Sales 5004 2020-01-31 7-28-20 USD 14.84 2 29.68

200 rows × 12 columns

Replacing parts of text¶

In [9]:
ledger_df['Channel'].str.replace('.com', '')
/tmp/ipykernel_34612/3620803427.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  ledger_df['Channel'].str.replace('.com', '')
Out[9]:
0            Shoppe
1           Walcart
2          Bullseye
3          Bullseye
4          Bullseye
            ...    
14049      Bullseye
14050      Bullseye
14051    Understock
14052          iBay
14053    Understock
Name: Channel, Length: 14054, dtype: object
In [10]:
ledger_df['Channel'].replace('iBay.com', 'Anazon.com')
Out[10]:
0            Shoppe.com
1               Walcart
2              Bullseye
3              Bullseye
4              Bullseye
              ...      
14049          Bullseye
14050          Bullseye
14051    Understock.com
14052        Anazon.com
14053    Understock.com
Name: Channel, Length: 14054, dtype: object

Splitting text values into multiple columns¶

In [11]:
ledger_df[['Product Name', 'ProductID', 'Unit Price', 'Quantity', 'Total']]
Out[11]:
Product Name ProductID Unit Price Quantity Total
0 Cannon Water Bomb Balloons 100 Pack T&G/CAN-97509 20.11 14 281.54
1 LEGO Ninja Turtles Stealth Shell in Pursuit 79102 T&G/LEG-37777 6.70 1 6.70
2 NaN T&G/PET-14209 11.67 5 58.35
3 Transformers Age of Extinction Generations Del... T&G/TRA-20170 13.46 6 80.76
4 Transformers Age of Extinction Generations Del... T&G/TRA-20170 13.46 6 80.76
... ... ... ... ... ...
14049 AC Adapter/Power Supply&Cord for Lenovo 3000 G... E/AC-63975 28.72 8 229.76
14050 Cisco Systems Gigabit VPN Router (RV320K9NA) E/CIS-74992 33.39 1 33.39
14051 Philips AJ3116M/37 Digital Tuning Clock Radio ... E/PHI-08100 4.18 1 4.18
14052 NaN E/POL-61164 4.78 25 119.50
14053 Sirius Satellite Radio XADH2 Home Access Kit f... E/SIR-83381 33.16 2 66.32

14054 rows × 5 columns

In [12]:
ledger_df['ProductID'].str.split('/')
Out[12]:
0        [T&G, CAN-97509]
1        [T&G, LEG-37777]
2        [T&G, PET-14209]
3        [T&G, TRA-20170]
4        [T&G, TRA-20170]
               ...       
14049       [E, AC-63975]
14050      [E, CIS-74992]
14051      [E, PHI-08100]
14052      [E, POL-61164]
14053      [E, SIR-83381]
Name: ProductID, Length: 14054, dtype: object
In [13]:
ledger_df['ProductID'].str.split('/', expand=True)
Out[13]:
0 1
0 T&G CAN-97509
1 T&G LEG-37777
2 T&G PET-14209
3 T&G TRA-20170
4 T&G TRA-20170
... ... ...
14049 E AC-63975
14050 E CIS-74992
14051 E PHI-08100
14052 E POL-61164
14053 E SIR-83381

14054 rows × 2 columns

In [14]:
ledger_df[['CategoryID', 'ItemID']] = ledger_df['ProductID'].str.split('/', expand=True)

ledger_df
Out[14]:
InvoiceNo Channel Product Name ProductID Account AccountNo Date Deadline Currency Unit Price Quantity Total CategoryID ItemID
0 1532 Shoppe.com Cannon Water Bomb Balloons 100 Pack T&G/CAN-97509 Sales 5004 2020-01-01 11/23/19 USD 20.11 14 281.54 T&G CAN-97509
1 1533 Walcart LEGO Ninja Turtles Stealth Shell in Pursuit 79102 T&G/LEG-37777 Sales 5004 2020-01-01 06/15/20 USD 6.70 1 6.70 T&G LEG-37777
2 1534 Bullseye NaN T&G/PET-14209 Sales 5004 2020-01-01 05/07/20 USD 11.67 5 58.35 T&G PET-14209
3 1535 Bullseye Transformers Age of Extinction Generations Del... T&G/TRA-20170 Sales 5004 2020-01-01 12/22/19 USD 13.46 6 80.76 T&G TRA-20170
4 1535 Bullseye Transformers Age of Extinction Generations Del... T&G/TRA-20170 Sales 5004 2020-01-01 12/22/19 USD 13.46 6 80.76 T&G TRA-20170
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14049 15581 Bullseye AC Adapter/Power Supply&Cord for Lenovo 3000 G... E/AC-63975 Sales 5004 2020-01-31 February 23 2020 USD 28.72 8 229.76 E AC-63975
14050 15582 Bullseye Cisco Systems Gigabit VPN Router (RV320K9NA) E/CIS-74992 Sales 5004 2020-01-31 January 21 2020 USD 33.39 1 33.39 E CIS-74992
14051 15583 Understock.com Philips AJ3116M/37 Digital Tuning Clock Radio ... E/PHI-08100 Sales 5004 2020-01-31 March 22 2020 USD 4.18 1 4.18 E PHI-08100
14052 15584 iBay.com NaN E/POL-61164 Sales 5004 2020-01-31 June 25 2020 USD 4.78 25 119.50 E POL-61164
14053 15585 Understock.com Sirius Satellite Radio XADH2 Home Access Kit f... E/SIR-83381 Sales 5004 2020-01-31 February 01 2020 USD 33.16 2 66.32 E SIR-83381

14054 rows × 14 columns

Concatenating text columns¶

In [15]:
'Category ID is: ' + ledger_df['CategoryID']
Out[15]:
0        Category ID is: T&G
1        Category ID is: T&G
2        Category ID is: T&G
3        Category ID is: T&G
4        Category ID is: T&G
                ...         
14049      Category ID is: E
14050      Category ID is: E
14051      Category ID is: E
14052      Category ID is: E
14053      Category ID is: E
Name: CategoryID, Length: 14054, dtype: object
In [16]:
ledger_df['CategoryID'] + '/' + ledger_df['ItemID']
Out[16]:
0        T&G/CAN-97509
1        T&G/LEG-37777
2        T&G/PET-14209
3        T&G/TRA-20170
4        T&G/TRA-20170
             ...      
14049       E/AC-63975
14050      E/CIS-74992
14051      E/PHI-08100
14052      E/POL-61164
14053      E/SIR-83381
Length: 14054, dtype: object
In [21]:
ledger_df['InvoiceNo'].astype('string') + '/' + ledger_df['AccountNo'].astype('string')
Out[21]:
0         1532/5004
1         1533/5004
2         1534/5004
3         1535/5004
4         1535/5004
            ...    
14049    15581/5004
14050    15582/5004
14051    15583/5004
14052    15584/5004
14053    15585/5004
Length: 14054, dtype: string

String data types in pandas¶

In [24]:
ledger_df = pd.read_excel('data/Q1Sales.xlsx').convert_dtypes()

ledger_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14054 entries, 0 to 14053
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   InvoiceNo     14054 non-null  Int64         
 1   Channel       14054 non-null  string        
 2   Product Name  12362 non-null  string        
 3   ProductID     14054 non-null  string        
 4   Account       14054 non-null  string        
 5   AccountNo     14054 non-null  Int64         
 6   Date          14054 non-null  datetime64[ns]
 7   Deadline      14054 non-null  string        
 8   Currency      14054 non-null  string        
 9   Unit Price    14054 non-null  Float64       
 10  Quantity      14054 non-null  Int64         
 11  Total         14054 non-null  Float64       
dtypes: Float64(2), Int64(3), datetime64[ns](1), string(6)
memory usage: 1.4 MB
In [25]:
ledger_df['Product Name'].astype('string')
Out[25]:
0                      Cannon Water Bomb Balloons 100 Pack
1        LEGO Ninja Turtles Stealth Shell in Pursuit 79102
2                                                     <NA>
3        Transformers Age of Extinction Generations Del...
4        Transformers Age of Extinction Generations Del...
                               ...                        
14049    AC Adapter/Power Supply&Cord for Lenovo 3000 G...
14050         Cisco Systems Gigabit VPN Router (RV320K9NA)
14051    Philips AJ3116M/37 Digital Tuning Clock Radio ...
14052                                                 <NA>
14053    Sirius Satellite Radio XADH2 Home Access Kit f...
Name: Product Name, Length: 14054, dtype: string
In [26]:
pd.Series([1011, '1320', "980", 645, 340])
Out[26]:
0    1011
1    1320
2     980
3     645
4     340
dtype: object
In [27]:
pd.Series([1011, '1320', "980", 645, 340]).str.strip('')
Out[27]:
0     NaN
1    1320
2     980
3     NaN
4     NaN
dtype: object
In [28]:
pd.Series([1011, '1320', "980", 645, 340]).astype('string').str.strip('')
Out[28]:
0    1011
1    1320
2     980
3     645
4     340
dtype: string

Overthinking: Regular expressions¶

In [29]:
pattern = '(Nikon).*(Camera)'
In [30]:
is_camera = ledger_df['Product Name'].str.contains('camera', case=False).fillna(False)

cameras_df = ledger_df[is_camera]
cameras_df = cameras_df[['ProductID', 'Product Name', 'Total']]

cameras_df
Out[30]:
ProductID Product Name Total
66 C&P/KID-94587 Kidz Digital Camera 275.64
117 C&P/KOD-01305 Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... 64.1
154 C&P/FOS-95687 Foscam FI8910W White Wireless IP Cameras 2-pack 669.2
265 C&P/Q-S-31839 Q-See QSC414D Outdoor Dome Color CCD Camera wi... 7.82
287 C&P/KOD-01305 Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... 64.1
... ... ... ...
13916 C&P/NEE-31972 NEEWER® 10x25 Zoom LCD Binoculars Built-in Dig... 37.68
13950 T&G/LIG-86589 Lights,Camera,Action Decor 23.02
13954 C&P/DAH-04621 Dahua IPC-HFW2100 S 1.3MP Weatherproof HD IP S... 126.0
13986 C&P/NIK-92147 Nikon D3100 14.2MP Digital SLR Camera with 18-... 54.1
13996 C&P/SEC-57209 Securityman Wi-Fi Interference Free Wireless O... 17.44

443 rows × 3 columns

In [31]:
pattern = '(Nikon|Canon|Kodak)'
cameras_df = cameras_df[cameras_df['Product Name'].str.contains(pattern)]

cameras_df
/tmp/ipykernel_34612/751983995.py:2: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  cameras_df = cameras_df[cameras_df['Product Name'].str.contains(pattern)]
Out[31]:
ProductID Product Name Total
117 C&P/KOD-01305 Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... 64.1
287 C&P/KOD-01305 Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... 64.1
616 C&P/KOD-32137 Kodak EasyShare Z990 12 MP Digital Camera with... 166.3
2151 C&P/CAN-50721 Canon PowerShot SX50 HS 12MP Digital Camera wi... 6.45
2280 C&P/CAN-50721 Canon PowerShot SX50 HS 12MP Digital Camera wi... 6.45
... ... ... ...
12459 C&P/CAN-12514 Canon EOS Rebel T2i DSLR Camera (Body Only) 35.4
12522 C&P/CAN-12514 Canon EOS Rebel T2i DSLR Camera (Body Only) 35.4
12905 C&P/KOD-32137 Kodak EasyShare Z990 12 MP Digital Camera with... 332.6
13594 C&P/NIK-92147 Nikon D3100 14.2MP Digital SLR Camera with 18-... 54.1
13986 C&P/NIK-92147 Nikon D3100 14.2MP Digital SLR Camera with 18-... 54.1

77 rows × 3 columns

In [32]:
cameras_df[
    (cameras_df['Product Name'].str.contains('Nikon'))
    | (cameras_df['Product Name'].str.contains('Canon'))
    | (cameras_df['Product Name'].str.contains('Kodak'))
]
Out[32]:
ProductID Product Name Total
117 C&P/KOD-01305 Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... 64.1
287 C&P/KOD-01305 Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... 64.1
616 C&P/KOD-32137 Kodak EasyShare Z990 12 MP Digital Camera with... 166.3
2151 C&P/CAN-50721 Canon PowerShot SX50 HS 12MP Digital Camera wi... 6.45
2280 C&P/CAN-50721 Canon PowerShot SX50 HS 12MP Digital Camera wi... 6.45
... ... ... ...
12459 C&P/CAN-12514 Canon EOS Rebel T2i DSLR Camera (Body Only) 35.4
12522 C&P/CAN-12514 Canon EOS Rebel T2i DSLR Camera (Body Only) 35.4
12905 C&P/KOD-32137 Kodak EasyShare Z990 12 MP Digital Camera with... 332.6
13594 C&P/NIK-92147 Nikon D3100 14.2MP Digital SLR Camera with 18-... 54.1
13986 C&P/NIK-92147 Nikon D3100 14.2MP Digital SLR Camera with 18-... 54.1

77 rows × 3 columns

In [33]:
pattern = '(Nikon|Canon|Kodak)\s+(\w*)'

cameras_df['Product Name'].str.extract(pattern)
Out[33]:
0 1
117 Kodak ZM1
287 Kodak ZM1
616 Kodak EasyShare
2151 Canon PowerShot
2280 Canon PowerShot
... ... ...
12459 Canon EOS
12522 Canon EOS
12905 Kodak EasyShare
13594 Nikon D3100
13986 Nikon D3100

77 rows × 2 columns

Applying custom functions¶

Applying functions to columns¶

In [34]:
def process_channel(channel):
    return 'Name: ' + channel.upper()
In [35]:
process_channel('Shoppe.com')
Out[35]:
'Name: SHOPPE.COM'
In [36]:
process_channel('Bullseye')
Out[36]:
'Name: BULLSEYE'
In [37]:
for channel in ledger_df['Channel']:
    process_channel(channel)
In [38]:
ledger_df['Channel'].apply(process_channel)
Out[38]:
0            Name: SHOPPE.COM
1               Name: WALCART
2              Name: BULLSEYE
3              Name: BULLSEYE
4              Name: BULLSEYE
                 ...         
14049          Name: BULLSEYE
14050          Name: BULLSEYE
14051    Name: UNDERSTOCK.COM
14052          Name: IBAY.COM
14053    Name: UNDERSTOCK.COM
Name: Channel, Length: 14054, dtype: object
In [41]:
def process_product(product):
    if pd.isna(product):
        return 'EMPTY PRODUCT NAME'
    else:
        return 'Product: ' + product.upper()
        
In [42]:
ledger_df['Product Name'].apply(process_product)
Out[42]:
0             Product: CANNON WATER BOMB BALLOONS 100 PACK
1        Product: LEGO NINJA TURTLES STEALTH SHELL IN P...
2                                       EMPTY PRODUCT NAME
3        Product: TRANSFORMERS AGE OF EXTINCTION GENERA...
4        Product: TRANSFORMERS AGE OF EXTINCTION GENERA...
                               ...                        
14049    Product: AC ADAPTER/POWER SUPPLY&CORD FOR LENO...
14050    Product: CISCO SYSTEMS GIGABIT VPN ROUTER (RV3...
14051    Product: PHILIPS AJ3116M/37 DIGITAL TUNING CLO...
14052                                   EMPTY PRODUCT NAME
14053    Product: SIRIUS SATELLITE RADIO XADH2 HOME ACC...
Name: Product Name, Length: 14054, dtype: object

Overthinking: Functions without a name¶

In [43]:
ledger_df['Channel'].apply(lambda channel: 'Name: ' + channel.upper())
Out[43]:
0            Name: SHOPPE.COM
1               Name: WALCART
2              Name: BULLSEYE
3              Name: BULLSEYE
4              Name: BULLSEYE
                 ...         
14049          Name: BULLSEYE
14050          Name: BULLSEYE
14051    Name: UNDERSTOCK.COM
14052          Name: IBAY.COM
14053    Name: UNDERSTOCK.COM
Name: Channel, Length: 14054, dtype: object
In [44]:
# regular function definition
def process_channel(channel):
    return 'Name: ' + channel.upper()

# anonymous function definition
lambda channel: 'Name: ' + channel.upper()
Out[44]:
<function __main__.<lambda>(channel)>

Applying functions to rows¶

In [45]:
first_row = ledger_df.iloc[0]

first_row
Out[45]:
InvoiceNo                                      1532
Channel                                  Shoppe.com
Product Name    Cannon Water Bomb Balloons 100 Pack
ProductID                             T&G/CAN-97509
Account                                       Sales
AccountNo                                      5004
Date                            2020-01-01 00:00:00
Deadline                                   11/23/19
Currency                                        USD
Unit Price                                    20.11
Quantity                                         14
Total                                        281.54
Name: 0, dtype: object
In [46]:
first_row['Total']
Out[46]:
281.54
In [47]:
first_row['Channel']
Out[47]:
'Shoppe.com'
In [48]:
def calculate_tax(row):
    if row['Channel'] == 'Shoppe.com':
        return row['Total'] * (16 / 100)
    elif row['Channel'] == 'iBay.com':
        return row['Total'] * (11 / 100)
    elif row['Channel'] == 'Understock.com':
        return row['Total'] * (9 / 100)
    else:
        return 0
In [49]:
calculate_tax(first_row)
Out[49]:
45.046400000000006
In [50]:
calculate_tax(ledger_df.iloc[10])
Out[50]:
0
In [51]:
calculate_tax(ledger_df.iloc[100])
Out[51]:
3.355
In [52]:
ledger_df.apply(calculate_tax, axis='columns')
Out[52]:
0        45.0464
1         0.0000
2         0.0000
3         0.0000
4         0.0000
          ...   
14049     0.0000
14050     0.0000
14051     0.3762
14052    13.1450
14053     5.9688
Length: 14054, dtype: float64

Overthinking: Other function parameters¶

In [53]:
def calculate_tax(row, levels={}):
    channel = row['Channel']
    total = row['Total']
    
    tax = 0
    if channel in levels:
      tax = levels[channel] 

    return total * tax
In [54]:
ledger_df.apply(calculate_tax, axis='columns')
Out[54]:
0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
14049    0.0
14050    0.0
14051    0.0
14052    0.0
14053    0.0
Length: 14054, dtype: float64
In [55]:
ledger_df.apply(
    calculate_tax, 
    levels={
        'Shoppe.com': (16 / 100),
        'iBay.com': (11 / 100),
        'Understock.com': (9 / 100),
    }, 
    axis='columns'
)
Out[55]:
0        45.0464
1         0.0000
2         0.0000
3         0.0000
4         0.0000
          ...   
14049     0.0000
14050     0.0000
14051     0.3762
14052    13.1450
14053     5.9688
Length: 14054, dtype: float64
In [56]:
ledger_df.apply(
    calculate_tax, 
    levels={
        'Shoppe.com': (16 / 100),
        'iBay.com': (11 / 100),
        'Understock.com': (9 / 100),
        'Bullseye': (6 / 100),
        'Walcart': (4 / 100),
    }, 
    axis='columns'
)
Out[56]:
0        45.0464
1         0.2680
2         3.5010
3         4.8456
4         4.8456
          ...   
14049    13.7856
14050     2.0034
14051     0.3762
14052    13.1450
14053     5.9688
Length: 14054, dtype: float64