Pandas data wrangling (1) - Groupby
Loading packages
import pandas as pd
import numpy as np
import os
path = os.getcwd()
print(path)
/Users/steve.han/git/stevehan310.github.io/notebooks
Loading data file (Titanic Data from Kaggle Competition)
df = pd.read_csv(os.path.join(path, 'data', 'titanic_train.csv'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Data Dictionary
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
Pandas Groupby
# Creating groupby object with embarked and Pclass
df_agg = df.groupby(['Embarked', 'Pclass'], as_index = False)
# Average Fare price per Port of Embarkation and Ticket Class
df_agg.agg(fare_avg = ('Fare', np.mean), # Get the average of Fare per group and add them into a column with name 'fare_avg'
fare_min = ('Fare', np.min), # Get a minimum of Fare per group and add them into a column with name 'fare_min'
fare_max = ('Fare', np.max), # Get a maximum of Fare per group and add them into a column with name 'fare_max'
num_passenger = ('PassengerId', 'nunique') # Get a number of passenger per group and add them into a column with name 'num_passenger'
)
# np.unique
# np.count_nonzero
# np.sum – Sum of values
# np.mean – Mean of values
# np.median – Arithmetic median of values
# np.min – Minimum
# np.max – Maximum
# np.std – Standard deviation
# np.var – Variance
|
Embarked |
Pclass |
fare_avg |
fare_min |
fare_max |
num_passenger |
0 |
C |
1 |
104.718529 |
26.5500 |
512.3292 |
85 |
1 |
C |
2 |
25.358335 |
12.0000 |
41.5792 |
17 |
2 |
C |
3 |
11.214083 |
4.0125 |
22.3583 |
66 |
3 |
Q |
1 |
90.000000 |
90.0000 |
90.0000 |
2 |
4 |
Q |
2 |
12.350000 |
12.3500 |
12.3500 |
3 |
5 |
Q |
3 |
11.183393 |
6.7500 |
29.1250 |
72 |
6 |
S |
1 |
70.364862 |
0.0000 |
263.0000 |
127 |
7 |
S |
2 |
20.327439 |
0.0000 |
73.5000 |
164 |
8 |
S |
3 |
14.644083 |
0.0000 |
69.5500 |
353 |
# You can also use 'apply' aggregation with your own function by using lambda
df_agg.apply(lambda x: x.Fare.max() - x.Fare.min()) # Range of Fare from min to max
|
Embarked |
Pclass |
None |
0 |
C |
1 |
485.7792 |
1 |
C |
2 |
29.5792 |
2 |
C |
3 |
18.3458 |
3 |
Q |
1 |
0.0000 |
4 |
Q |
2 |
0.0000 |
5 |
Q |
3 |
22.3750 |
6 |
S |
1 |
263.0000 |
7 |
S |
2 |
73.5000 |
8 |
S |
3 |
69.5500 |
# Same apply aggregation with different format
df_agg.agg({'Fare': lambda x: x.max() - x.min()})
|
Embarked |
Pclass |
Fare |
0 |
C |
1 |
485.7792 |
1 |
C |
2 |
29.5792 |
2 |
C |
3 |
18.3458 |
3 |
Q |
1 |
0.0000 |
4 |
Q |
2 |
0.0000 |
5 |
Q |
3 |
22.3750 |
6 |
S |
1 |
263.0000 |
7 |
S |
2 |
73.5000 |
8 |
S |
3 |
69.5500 |
Comments