Pandas: Groupby Method

February 6, 2024 3 minute read

Pandas data wrangling (1) - Groupby

Loading packages

import pandas as pd
import numpy as np
import os
path = os.getcwd()
print(path)

/Users/steve.han/git/stevehan310.github.io/notebooks

Loading data file (Titanic Data from Kaggle Competition)

df = pd.read_csv(os.path.join(path, 'data', 'titanic_train.csv'))

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Pandas Groupby

# Creating groupby object with embarked and Pclass
df_agg = df.groupby(['Embarked', 'Pclass'], as_index = False)

# Average Fare price per Port of Embarkation and Ticket Class
df_agg.agg(fare_avg = ('Fare', np.mean), # Get the average of Fare per group and add them into a column with name 'fare_avg'
           fare_min = ('Fare', np.min), # Get a minimum of Fare per group and add them into a column with name 'fare_min'
           fare_max = ('Fare', np.max), # Get a maximum of Fare per group and add them into a column with name 'fare_max'
           num_passenger = ('PassengerId', 'nunique') # Get a number of passenger per group and add them into a column with name 'num_passenger'
          )

# np.unique
# np.count_nonzero  
# np.sum – Sum of values
# np.mean – Mean of values
# np.median – Arithmetic median of values
# np.min – Minimum
# np.max – Maximum
# np.std – Standard deviation
# np.var – Variance

	Embarked	Pclass	fare_avg	fare_min	fare_max	num_passenger
0	C	1	104.718529	26.5500	512.3292	85
1	C	2	25.358335	12.0000	41.5792	17
2	C	3	11.214083	4.0125	22.3583	66
3	Q	1	90.000000	90.0000	90.0000	2
4	Q	2	12.350000	12.3500	12.3500	3
5	Q	3	11.183393	6.7500	29.1250	72
6	S	1	70.364862	0.0000	263.0000	127
7	S	2	20.327439	0.0000	73.5000	164
8	S	3	14.644083	0.0000	69.5500	353

# You can also use 'apply' aggregation with your own function by using lambda
df_agg.apply(lambda x: x.Fare.max() - x.Fare.min()) # Range of Fare from min to max

	Embarked	Pclass	None
0	C	1	485.7792
1	C	2	29.5792
2	C	3	18.3458
3	Q	1	0.0000
4	Q	2	0.0000
5	Q	3	22.3750
6	S	1	263.0000
7	S	2	73.5000
8	S	3	69.5500

# Same apply aggregation with different format
df_agg.agg({'Fare': lambda x: x.max() - x.min()})

	Embarked	Pclass	Fare
0	C	1	485.7792
1	C	2	29.5792
2	C	3	18.3458
3	Q	1	0.0000
4	Q	2	0.0000
5	Q	3	22.3750
6	S	1	263.0000
7	S	2	73.5000
8	S	3	69.5500

Twitter Facebook LinkedIn

Pandas: Groupby Method

Pandas data wrangling (1) - Groupby

Loading packages

Loading data file (Titanic Data from Kaggle Competition)

Data Dictionary

Variable Notes

Pandas Groupby

Comments

You May Also Enjoy

Simple Linear Regression Model in PyTorch and Tensorflow

Difference in Differences

Generalized Linear Models

Log Likelihood Estimation