Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
import pandas as pd
ledger_df = pd.read_excel('data/Q1Sales.xlsx') # read the first sheet
ledger_df['Channel'].str.upper()
0 SHOPPE.COM 1 WALCART 2 BULLSEYE 3 BULLSEYE 4 BULLSEYE ... 14049 BULLSEYE 14050 BULLSEYE 14051 UNDERSTOCK.COM 14052 IBAY.COM 14053 UNDERSTOCK.COM Name: Channel, Length: 14054, dtype: object
ledger_df['Product Name']
0 Cannon Water Bomb Balloons 100 Pack 1 LEGO Ninja Turtles Stealth Shell in Pursuit 79102 2 NaN 3 Transformers Age of Extinction Generations Del... 4 Transformers Age of Extinction Generations Del... ... 14049 AC Adapter/Power Supply&Cord for Lenovo 3000 G... 14050 Cisco Systems Gigabit VPN Router (RV320K9NA) 14051 Philips AJ3116M/37 Digital Tuning Clock Radio ... 14052 NaN 14053 Sirius Satellite Radio XADH2 Home Access Kit f... Name: Product Name, Length: 14054, dtype: object
ledger_df['Product Name'].str.upper()
0 CANNON WATER BOMB BALLOONS 100 PACK 1 LEGO NINJA TURTLES STEALTH SHELL IN PURSUIT 79102 2 NaN 3 TRANSFORMERS AGE OF EXTINCTION GENERATIONS DEL... 4 TRANSFORMERS AGE OF EXTINCTION GENERATIONS DEL... ... 14049 AC ADAPTER/POWER SUPPLY&CORD FOR LENOVO 3000 G... 14050 CISCO SYSTEMS GIGABIT VPN ROUTER (RV320K9NA) 14051 PHILIPS AJ3116M/37 DIGITAL TUNING CLOCK RADIO ... 14052 NaN 14053 SIRIUS SATELLITE RADIO XADH2 HOME ACCESS KIT F... Name: Product Name, Length: 14054, dtype: object
ledger_df['Product Name'].str.contains('LEGO')
0 False 1 True 2 NaN 3 False 4 False ... 14049 False 14050 False 14051 False 14052 NaN 14053 False Name: Product Name, Length: 14054, dtype: object
ledger_df[ledger_df['Product Name'].str.contains('LEGO').fillna(False)]
InvoiceNo | Channel | Product Name | ProductID | Account | AccountNo | Date | Deadline | Currency | Unit Price | Quantity | Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1533 | Walcart | LEGO Ninja Turtles Stealth Shell in Pursuit 79102 | T&G/LEG-37777 | Sales | 5004 | 2020-01-01 | 06/15/20 | USD | 6.70 | 1 | 6.70 |
43 | 1575 | iBay.com | LEGO Star Wars Clone Troopers vs Droidekas 75000 | T&G/LEG-16040 | Sales | 5004 | 2020-01-01 | 3-22-20 | USD | 14.68 | 2 | 29.36 |
105 | 1637 | Bullseye | LEGO LOTR 79006 The Council of Elrond | T&G/LEG-76682 | Sales | 5004 | 2020-01-01 | 2-19-20 | USD | 7.67 | 6 | 46.02 |
176 | 1708 | Shoppe.com | LEGO City Fire Chief Car 60001 | T&G/LEG-89613 | Sales | 5004 | 2020-01-01 | 3-28-20 | USD | 24.95 | 1 | 24.95 |
228 | 1608 | iBay.com | LEGO Star Wars Clone Troopers vs Droidekas 75000 | T&G/LEG-16040 | Sales | 5004 | 2020-01-01 | 3-22-20 | USD | 14.68 | 2 | 29.36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13525 | 15007 | Understock.com | LEGO City Trains High-speed Passenger Train 60... | T&G/LEG-51950 | Sales | 5004 | 2020-01-30 | 4-24-20 | USD | 7.41 | 18 | 133.38 |
13550 | 15082 | Walcart | LEGO City Fire Chief Car 60001 | T&G/LEG-89613 | Sales | 5004 | 2020-01-31 | 3-22-20 | USD | 25.11 | 8 | 200.88 |
13731 | 15131 | Walcart | LEGO City Fire Chief Car 60001 | T&G/LEG-89613 | Sales | 5004 | 2020-01-31 | 3-22-20 | USD | 25.11 | 8 | 200.88 |
13753 | 15285 | Understock.com | LEGO Star Wars Clone Troopers vs Droidekas 75000 | T&G/LEG-16040 | Sales | 5004 | 2020-01-31 | 7-28-20 | USD | 14.84 | 2 | 29.68 |
14031 | 15468 | Understock.com | LEGO Star Wars Clone Troopers vs Droidekas 75000 | T&G/LEG-16040 | Sales | 5004 | 2020-01-31 | 7-28-20 | USD | 14.84 | 2 | 29.68 |
200 rows × 12 columns
ledger_df['Channel'].str.replace('.com', '')
/tmp/ipykernel_34612/3620803427.py:1: FutureWarning: The default value of regex will change from True to False in a future version. ledger_df['Channel'].str.replace('.com', '')
0 Shoppe 1 Walcart 2 Bullseye 3 Bullseye 4 Bullseye ... 14049 Bullseye 14050 Bullseye 14051 Understock 14052 iBay 14053 Understock Name: Channel, Length: 14054, dtype: object
ledger_df['Channel'].replace('iBay.com', 'Anazon.com')
0 Shoppe.com 1 Walcart 2 Bullseye 3 Bullseye 4 Bullseye ... 14049 Bullseye 14050 Bullseye 14051 Understock.com 14052 Anazon.com 14053 Understock.com Name: Channel, Length: 14054, dtype: object
ledger_df[['Product Name', 'ProductID', 'Unit Price', 'Quantity', 'Total']]
Product Name | ProductID | Unit Price | Quantity | Total | |
---|---|---|---|---|---|
0 | Cannon Water Bomb Balloons 100 Pack | T&G/CAN-97509 | 20.11 | 14 | 281.54 |
1 | LEGO Ninja Turtles Stealth Shell in Pursuit 79102 | T&G/LEG-37777 | 6.70 | 1 | 6.70 |
2 | NaN | T&G/PET-14209 | 11.67 | 5 | 58.35 |
3 | Transformers Age of Extinction Generations Del... | T&G/TRA-20170 | 13.46 | 6 | 80.76 |
4 | Transformers Age of Extinction Generations Del... | T&G/TRA-20170 | 13.46 | 6 | 80.76 |
... | ... | ... | ... | ... | ... |
14049 | AC Adapter/Power Supply&Cord for Lenovo 3000 G... | E/AC-63975 | 28.72 | 8 | 229.76 |
14050 | Cisco Systems Gigabit VPN Router (RV320K9NA) | E/CIS-74992 | 33.39 | 1 | 33.39 |
14051 | Philips AJ3116M/37 Digital Tuning Clock Radio ... | E/PHI-08100 | 4.18 | 1 | 4.18 |
14052 | NaN | E/POL-61164 | 4.78 | 25 | 119.50 |
14053 | Sirius Satellite Radio XADH2 Home Access Kit f... | E/SIR-83381 | 33.16 | 2 | 66.32 |
14054 rows × 5 columns
ledger_df['ProductID'].str.split('/')
0 [T&G, CAN-97509] 1 [T&G, LEG-37777] 2 [T&G, PET-14209] 3 [T&G, TRA-20170] 4 [T&G, TRA-20170] ... 14049 [E, AC-63975] 14050 [E, CIS-74992] 14051 [E, PHI-08100] 14052 [E, POL-61164] 14053 [E, SIR-83381] Name: ProductID, Length: 14054, dtype: object
ledger_df['ProductID'].str.split('/', expand=True)
0 | 1 | |
---|---|---|
0 | T&G | CAN-97509 |
1 | T&G | LEG-37777 |
2 | T&G | PET-14209 |
3 | T&G | TRA-20170 |
4 | T&G | TRA-20170 |
... | ... | ... |
14049 | E | AC-63975 |
14050 | E | CIS-74992 |
14051 | E | PHI-08100 |
14052 | E | POL-61164 |
14053 | E | SIR-83381 |
14054 rows × 2 columns
ledger_df[['CategoryID', 'ItemID']] = ledger_df['ProductID'].str.split('/', expand=True)
ledger_df
InvoiceNo | Channel | Product Name | ProductID | Account | AccountNo | Date | Deadline | Currency | Unit Price | Quantity | Total | CategoryID | ItemID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1532 | Shoppe.com | Cannon Water Bomb Balloons 100 Pack | T&G/CAN-97509 | Sales | 5004 | 2020-01-01 | 11/23/19 | USD | 20.11 | 14 | 281.54 | T&G | CAN-97509 |
1 | 1533 | Walcart | LEGO Ninja Turtles Stealth Shell in Pursuit 79102 | T&G/LEG-37777 | Sales | 5004 | 2020-01-01 | 06/15/20 | USD | 6.70 | 1 | 6.70 | T&G | LEG-37777 |
2 | 1534 | Bullseye | NaN | T&G/PET-14209 | Sales | 5004 | 2020-01-01 | 05/07/20 | USD | 11.67 | 5 | 58.35 | T&G | PET-14209 |
3 | 1535 | Bullseye | Transformers Age of Extinction Generations Del... | T&G/TRA-20170 | Sales | 5004 | 2020-01-01 | 12/22/19 | USD | 13.46 | 6 | 80.76 | T&G | TRA-20170 |
4 | 1535 | Bullseye | Transformers Age of Extinction Generations Del... | T&G/TRA-20170 | Sales | 5004 | 2020-01-01 | 12/22/19 | USD | 13.46 | 6 | 80.76 | T&G | TRA-20170 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14049 | 15581 | Bullseye | AC Adapter/Power Supply&Cord for Lenovo 3000 G... | E/AC-63975 | Sales | 5004 | 2020-01-31 | February 23 2020 | USD | 28.72 | 8 | 229.76 | E | AC-63975 |
14050 | 15582 | Bullseye | Cisco Systems Gigabit VPN Router (RV320K9NA) | E/CIS-74992 | Sales | 5004 | 2020-01-31 | January 21 2020 | USD | 33.39 | 1 | 33.39 | E | CIS-74992 |
14051 | 15583 | Understock.com | Philips AJ3116M/37 Digital Tuning Clock Radio ... | E/PHI-08100 | Sales | 5004 | 2020-01-31 | March 22 2020 | USD | 4.18 | 1 | 4.18 | E | PHI-08100 |
14052 | 15584 | iBay.com | NaN | E/POL-61164 | Sales | 5004 | 2020-01-31 | June 25 2020 | USD | 4.78 | 25 | 119.50 | E | POL-61164 |
14053 | 15585 | Understock.com | Sirius Satellite Radio XADH2 Home Access Kit f... | E/SIR-83381 | Sales | 5004 | 2020-01-31 | February 01 2020 | USD | 33.16 | 2 | 66.32 | E | SIR-83381 |
14054 rows × 14 columns
'Category ID is: ' + ledger_df['CategoryID']
0 Category ID is: T&G 1 Category ID is: T&G 2 Category ID is: T&G 3 Category ID is: T&G 4 Category ID is: T&G ... 14049 Category ID is: E 14050 Category ID is: E 14051 Category ID is: E 14052 Category ID is: E 14053 Category ID is: E Name: CategoryID, Length: 14054, dtype: object
ledger_df['CategoryID'] + '/' + ledger_df['ItemID']
0 T&G/CAN-97509 1 T&G/LEG-37777 2 T&G/PET-14209 3 T&G/TRA-20170 4 T&G/TRA-20170 ... 14049 E/AC-63975 14050 E/CIS-74992 14051 E/PHI-08100 14052 E/POL-61164 14053 E/SIR-83381 Length: 14054, dtype: object
ledger_df['InvoiceNo'].astype('string') + '/' + ledger_df['AccountNo'].astype('string')
0 1532/5004 1 1533/5004 2 1534/5004 3 1535/5004 4 1535/5004 ... 14049 15581/5004 14050 15582/5004 14051 15583/5004 14052 15584/5004 14053 15585/5004 Length: 14054, dtype: string
ledger_df = pd.read_excel('data/Q1Sales.xlsx').convert_dtypes()
ledger_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14054 entries, 0 to 14053 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 14054 non-null Int64 1 Channel 14054 non-null string 2 Product Name 12362 non-null string 3 ProductID 14054 non-null string 4 Account 14054 non-null string 5 AccountNo 14054 non-null Int64 6 Date 14054 non-null datetime64[ns] 7 Deadline 14054 non-null string 8 Currency 14054 non-null string 9 Unit Price 14054 non-null Float64 10 Quantity 14054 non-null Int64 11 Total 14054 non-null Float64 dtypes: Float64(2), Int64(3), datetime64[ns](1), string(6) memory usage: 1.4 MB
ledger_df['Product Name'].astype('string')
0 Cannon Water Bomb Balloons 100 Pack 1 LEGO Ninja Turtles Stealth Shell in Pursuit 79102 2 <NA> 3 Transformers Age of Extinction Generations Del... 4 Transformers Age of Extinction Generations Del... ... 14049 AC Adapter/Power Supply&Cord for Lenovo 3000 G... 14050 Cisco Systems Gigabit VPN Router (RV320K9NA) 14051 Philips AJ3116M/37 Digital Tuning Clock Radio ... 14052 <NA> 14053 Sirius Satellite Radio XADH2 Home Access Kit f... Name: Product Name, Length: 14054, dtype: string
pd.Series([1011, '1320', "980", 645, 340])
0 1011 1 1320 2 980 3 645 4 340 dtype: object
pd.Series([1011, '1320', "980", 645, 340]).str.strip('')
0 NaN 1 1320 2 980 3 NaN 4 NaN dtype: object
pd.Series([1011, '1320', "980", 645, 340]).astype('string').str.strip('')
0 1011 1 1320 2 980 3 645 4 340 dtype: string
pattern = '(Nikon).*(Camera)'
is_camera = ledger_df['Product Name'].str.contains('camera', case=False).fillna(False)
cameras_df = ledger_df[is_camera]
cameras_df = cameras_df[['ProductID', 'Product Name', 'Total']]
cameras_df
ProductID | Product Name | Total | |
---|---|---|---|
66 | C&P/KID-94587 | Kidz Digital Camera | 275.64 |
117 | C&P/KOD-01305 | Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... | 64.1 |
154 | C&P/FOS-95687 | Foscam FI8910W White Wireless IP Cameras 2-pack | 669.2 |
265 | C&P/Q-S-31839 | Q-See QSC414D Outdoor Dome Color CCD Camera wi... | 7.82 |
287 | C&P/KOD-01305 | Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... | 64.1 |
... | ... | ... | ... |
13916 | C&P/NEE-31972 | NEEWER® 10x25 Zoom LCD Binoculars Built-in Dig... | 37.68 |
13950 | T&G/LIG-86589 | Lights,Camera,Action Decor | 23.02 |
13954 | C&P/DAH-04621 | Dahua IPC-HFW2100 S 1.3MP Weatherproof HD IP S... | 126.0 |
13986 | C&P/NIK-92147 | Nikon D3100 14.2MP Digital SLR Camera with 18-... | 54.1 |
13996 | C&P/SEC-57209 | Securityman Wi-Fi Interference Free Wireless O... | 17.44 |
443 rows × 3 columns
pattern = '(Nikon|Canon|Kodak)'
cameras_df = cameras_df[cameras_df['Product Name'].str.contains(pattern)]
cameras_df
/tmp/ipykernel_34612/751983995.py:2: UserWarning: This pattern has match groups. To actually get the groups, use str.extract. cameras_df = cameras_df[cameras_df['Product Name'].str.contains(pattern)]
ProductID | Product Name | Total | |
---|---|---|---|
117 | C&P/KOD-01305 | Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... | 64.1 |
287 | C&P/KOD-01305 | Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... | 64.1 |
616 | C&P/KOD-32137 | Kodak EasyShare Z990 12 MP Digital Camera with... | 166.3 |
2151 | C&P/CAN-50721 | Canon PowerShot SX50 HS 12MP Digital Camera wi... | 6.45 |
2280 | C&P/CAN-50721 | Canon PowerShot SX50 HS 12MP Digital Camera wi... | 6.45 |
... | ... | ... | ... |
12459 | C&P/CAN-12514 | Canon EOS Rebel T2i DSLR Camera (Body Only) | 35.4 |
12522 | C&P/CAN-12514 | Canon EOS Rebel T2i DSLR Camera (Body Only) | 35.4 |
12905 | C&P/KOD-32137 | Kodak EasyShare Z990 12 MP Digital Camera with... | 332.6 |
13594 | C&P/NIK-92147 | Nikon D3100 14.2MP Digital SLR Camera with 18-... | 54.1 |
13986 | C&P/NIK-92147 | Nikon D3100 14.2MP Digital SLR Camera with 18-... | 54.1 |
77 rows × 3 columns
cameras_df[
(cameras_df['Product Name'].str.contains('Nikon'))
| (cameras_df['Product Name'].str.contains('Canon'))
| (cameras_df['Product Name'].str.contains('Kodak'))
]
ProductID | Product Name | Total | |
---|---|---|---|
117 | C&P/KOD-01305 | Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... | 64.1 |
287 | C&P/KOD-01305 | Kodak ZM1-NM 1 MP 1-Inch LCD CMOS Sensor Digit... | 64.1 |
616 | C&P/KOD-32137 | Kodak EasyShare Z990 12 MP Digital Camera with... | 166.3 |
2151 | C&P/CAN-50721 | Canon PowerShot SX50 HS 12MP Digital Camera wi... | 6.45 |
2280 | C&P/CAN-50721 | Canon PowerShot SX50 HS 12MP Digital Camera wi... | 6.45 |
... | ... | ... | ... |
12459 | C&P/CAN-12514 | Canon EOS Rebel T2i DSLR Camera (Body Only) | 35.4 |
12522 | C&P/CAN-12514 | Canon EOS Rebel T2i DSLR Camera (Body Only) | 35.4 |
12905 | C&P/KOD-32137 | Kodak EasyShare Z990 12 MP Digital Camera with... | 332.6 |
13594 | C&P/NIK-92147 | Nikon D3100 14.2MP Digital SLR Camera with 18-... | 54.1 |
13986 | C&P/NIK-92147 | Nikon D3100 14.2MP Digital SLR Camera with 18-... | 54.1 |
77 rows × 3 columns
pattern = '(Nikon|Canon|Kodak)\s+(\w*)'
cameras_df['Product Name'].str.extract(pattern)
0 | 1 | |
---|---|---|
117 | Kodak | ZM1 |
287 | Kodak | ZM1 |
616 | Kodak | EasyShare |
2151 | Canon | PowerShot |
2280 | Canon | PowerShot |
... | ... | ... |
12459 | Canon | EOS |
12522 | Canon | EOS |
12905 | Kodak | EasyShare |
13594 | Nikon | D3100 |
13986 | Nikon | D3100 |
77 rows × 2 columns
def process_channel(channel):
return 'Name: ' + channel.upper()
process_channel('Shoppe.com')
'Name: SHOPPE.COM'
process_channel('Bullseye')
'Name: BULLSEYE'
for channel in ledger_df['Channel']:
process_channel(channel)
ledger_df['Channel'].apply(process_channel)
0 Name: SHOPPE.COM 1 Name: WALCART 2 Name: BULLSEYE 3 Name: BULLSEYE 4 Name: BULLSEYE ... 14049 Name: BULLSEYE 14050 Name: BULLSEYE 14051 Name: UNDERSTOCK.COM 14052 Name: IBAY.COM 14053 Name: UNDERSTOCK.COM Name: Channel, Length: 14054, dtype: object
def process_product(product):
if pd.isna(product):
return 'EMPTY PRODUCT NAME'
else:
return 'Product: ' + product.upper()
ledger_df['Product Name'].apply(process_product)
0 Product: CANNON WATER BOMB BALLOONS 100 PACK 1 Product: LEGO NINJA TURTLES STEALTH SHELL IN P... 2 EMPTY PRODUCT NAME 3 Product: TRANSFORMERS AGE OF EXTINCTION GENERA... 4 Product: TRANSFORMERS AGE OF EXTINCTION GENERA... ... 14049 Product: AC ADAPTER/POWER SUPPLY&CORD FOR LENO... 14050 Product: CISCO SYSTEMS GIGABIT VPN ROUTER (RV3... 14051 Product: PHILIPS AJ3116M/37 DIGITAL TUNING CLO... 14052 EMPTY PRODUCT NAME 14053 Product: SIRIUS SATELLITE RADIO XADH2 HOME ACC... Name: Product Name, Length: 14054, dtype: object
ledger_df['Channel'].apply(lambda channel: 'Name: ' + channel.upper())
0 Name: SHOPPE.COM 1 Name: WALCART 2 Name: BULLSEYE 3 Name: BULLSEYE 4 Name: BULLSEYE ... 14049 Name: BULLSEYE 14050 Name: BULLSEYE 14051 Name: UNDERSTOCK.COM 14052 Name: IBAY.COM 14053 Name: UNDERSTOCK.COM Name: Channel, Length: 14054, dtype: object
# regular function definition
def process_channel(channel):
return 'Name: ' + channel.upper()
# anonymous function definition
lambda channel: 'Name: ' + channel.upper()
<function __main__.<lambda>(channel)>
first_row = ledger_df.iloc[0]
first_row
InvoiceNo 1532 Channel Shoppe.com Product Name Cannon Water Bomb Balloons 100 Pack ProductID T&G/CAN-97509 Account Sales AccountNo 5004 Date 2020-01-01 00:00:00 Deadline 11/23/19 Currency USD Unit Price 20.11 Quantity 14 Total 281.54 Name: 0, dtype: object
first_row['Total']
281.54
first_row['Channel']
'Shoppe.com'
def calculate_tax(row):
if row['Channel'] == 'Shoppe.com':
return row['Total'] * (16 / 100)
elif row['Channel'] == 'iBay.com':
return row['Total'] * (11 / 100)
elif row['Channel'] == 'Understock.com':
return row['Total'] * (9 / 100)
else:
return 0
calculate_tax(first_row)
45.046400000000006
calculate_tax(ledger_df.iloc[10])
0
calculate_tax(ledger_df.iloc[100])
3.355
ledger_df.apply(calculate_tax, axis='columns')
0 45.0464 1 0.0000 2 0.0000 3 0.0000 4 0.0000 ... 14049 0.0000 14050 0.0000 14051 0.3762 14052 13.1450 14053 5.9688 Length: 14054, dtype: float64
def calculate_tax(row, levels={}):
channel = row['Channel']
total = row['Total']
tax = 0
if channel in levels:
tax = levels[channel]
return total * tax
ledger_df.apply(calculate_tax, axis='columns')
0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 ... 14049 0.0 14050 0.0 14051 0.0 14052 0.0 14053 0.0 Length: 14054, dtype: float64
ledger_df.apply(
calculate_tax,
levels={
'Shoppe.com': (16 / 100),
'iBay.com': (11 / 100),
'Understock.com': (9 / 100),
},
axis='columns'
)
0 45.0464 1 0.0000 2 0.0000 3 0.0000 4 0.0000 ... 14049 0.0000 14050 0.0000 14051 0.3762 14052 13.1450 14053 5.9688 Length: 14054, dtype: float64
ledger_df.apply(
calculate_tax,
levels={
'Shoppe.com': (16 / 100),
'iBay.com': (11 / 100),
'Understock.com': (9 / 100),
'Bullseye': (6 / 100),
'Walcart': (4 / 100),
},
axis='columns'
)
0 45.0464 1 0.2680 2 3.5010 3 4.8456 4 4.8456 ... 14049 13.7856 14050 2.0034 14051 0.3762 14052 13.1450 14053 5.9688 Length: 14054, dtype: float64