3 minute read

Pandas data wrangling (1) - Groupby

Loading packages

import pandas as pd
import numpy as np
import os
path = os.getcwd()
print(path)
/Users/steve.han/git/stevehan310.github.io/notebooks

Loading data file (Titanic Data from Kaggle Competition)

df = pd.read_csv(os.path.join(path, 'data', 'titanic_train.csv'))
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Pandas Groupby

# Creating groupby object with embarked and Pclass
df_agg = df.groupby(['Embarked', 'Pclass'], as_index = False)
# Average Fare price per Port of Embarkation and Ticket Class
df_agg.agg(fare_avg = ('Fare', np.mean), # Get the average of Fare per group and add them into a column with name 'fare_avg'
           fare_min = ('Fare', np.min), # Get a minimum of Fare per group and add them into a column with name 'fare_min'
           fare_max = ('Fare', np.max), # Get a maximum of Fare per group and add them into a column with name 'fare_max'
           num_passenger = ('PassengerId', 'nunique') # Get a number of passenger per group and add them into a column with name 'num_passenger'
          )

# np.unique
# np.count_nonzero  
# np.sum – Sum of values
# np.mean – Mean of values
# np.median – Arithmetic median of values
# np.min – Minimum
# np.max – Maximum
# np.std – Standard deviation
# np.var – Variance
Embarked Pclass fare_avg fare_min fare_max num_passenger
0 C 1 104.718529 26.5500 512.3292 85
1 C 2 25.358335 12.0000 41.5792 17
2 C 3 11.214083 4.0125 22.3583 66
3 Q 1 90.000000 90.0000 90.0000 2
4 Q 2 12.350000 12.3500 12.3500 3
5 Q 3 11.183393 6.7500 29.1250 72
6 S 1 70.364862 0.0000 263.0000 127
7 S 2 20.327439 0.0000 73.5000 164
8 S 3 14.644083 0.0000 69.5500 353
# You can also use 'apply' aggregation with your own function by using lambda
df_agg.apply(lambda x: x.Fare.max() - x.Fare.min()) # Range of Fare from min to max
Embarked Pclass None
0 C 1 485.7792
1 C 2 29.5792
2 C 3 18.3458
3 Q 1 0.0000
4 Q 2 0.0000
5 Q 3 22.3750
6 S 1 263.0000
7 S 2 73.5000
8 S 3 69.5500
# Same apply aggregation with different format
df_agg.agg({'Fare': lambda x: x.max() - x.min()})
Embarked Pclass Fare
0 C 1 485.7792
1 C 2 29.5792
2 C 3 18.3458
3 Q 1 0.0000
4 Q 2 0.0000
5 Q 3 22.3750
6 S 1 263.0000
7 S 2 73.5000
8 S 3 69.5500

Comments