Difference in Differences
Difference in Differences in Python
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
%precision %.2f
'%.2f'
df = pd.read_csv('./data/employment.csv')
The dataset is adapted from the dataset in Card and Krueger (1994), which estimates the causal effect of an increase in the state minimum wage on the employment.
On April 1, 1992, New Jersey raised the state minimum wage from 4.25 USD to 5.05 USD while the minimum wage in Pennsylvania stays the same at 4.25 USD. data about employment in fast-food restaurants in NJ (0) and PA (1) were collected in February 1992 and in November 1992. 384 restaurants in total after removing null values The calculation of DID is simple:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state 384 non-null int64
1 total_emp_feb 384 non-null float64
2 total_emp_nov 384 non-null float64
dtypes: float64(2), int64(1)
memory usage: 9.1 KB
df.head()
state | total_emp_feb | total_emp_nov | |
---|---|---|---|
0 | 0 | 40.50 | 24.00 |
1 | 0 | 13.75 | 11.50 |
2 | 0 | 8.50 | 10.50 |
3 | 0 | 34.00 | 20.00 |
4 | 0 | 24.00 | 35.50 |
df.state.value_counts()
1 309
0 75
Name: state, dtype: int64
New Jersey data
# New Jersey
df_nj = df[df.state == 0]
df_nj['delta'] = df_nj['total_emp_nov'] - df_nj['total_emp_feb']
print(f'the change of employment from feb to nov in New Jersey is {df_nj.delta.mean():.2f}')
the change of employment from feb to nov in New Jersey is -2.28
Pennsylvania
df_pa = df[df.state == 1]
df_pa['delta'] = df_pa['total_emp_nov'] - df_pa['total_emp_feb']
print(f'the change of employment from feb to nov in Pennsylvania is {df_pa.delta.mean():.2f}')
the change of employment from feb to nov in Pennsylvania is 0.47
The difference between New Jersey and Pennsylvania is 0.47 - (-2.28) = 2.75. However, how can we know that this is statistically significant? We use the linear regression.
𝑦 = 𝛽0 + 𝛽1∗state + 𝛽2∗month + 𝛽3∗(interaction) + 𝜀
state is 0 for the control group (New Jersey) and 1 for the treatment group (Pennsylvania) month is 0 for before (Feb) and 1 for after (Nov) we can insert the values of state and month using the table below and see that coefficient (𝛽3) of the interaction of state and month is the value for DID
df_train = pd.melt(df, id_vars=['state'], value_vars=['total_emp_feb', 'total_emp_nov'], var_name='month', value_name='total_emp')
df_train
state | month | total_emp | |
---|---|---|---|
0 | 0 | total_emp_feb | 40.50 |
1 | 0 | total_emp_feb | 13.75 |
2 | 0 | total_emp_feb | 8.50 |
3 | 0 | total_emp_feb | 34.00 |
4 | 0 | total_emp_feb | 24.00 |
... | ... | ... | ... |
763 | 1 | total_emp_nov | 23.75 |
764 | 1 | total_emp_nov | 17.50 |
765 | 1 | total_emp_nov | 20.50 |
766 | 1 | total_emp_nov | 20.50 |
767 | 1 | total_emp_nov | 25.00 |
768 rows × 3 columns
df_train.loc[df_train.month == 'total_emp_feb', 'month'] = 0
df_train.loc[df_train.month == 'total_emp_nov', 'month'] = 1
df_train['interaction'] = df_train['state'] * df_train['month']
df_train.state.value_counts()
1 618
0 150
Name: state, dtype: int64
df_train.month.value_counts()
0 384
1 384
Name: month, dtype: int64
df_train.month.value_counts()
0 384
1 384
Name: month, dtype: int64
df_train.interaction.value_counts()
0 459
1 309
Name: interaction, dtype: int64
import statsmodels.formula.api as sm
model = sm.ols('total_emp ~ state + month + interaction', data=df_train).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: total_emp R-squared: 0.008
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 1.947
Date: Thu, 25 Apr 2024 Prob (F-statistic): 0.121
Time: 21:22:02 Log-Likelihood: -2817.6
No. Observations: 768 AIC: 5643.
Df Residuals: 764 BIC: 5662.
Df Model: 3
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 23.3800 1.098 21.288 0.000 21.224 25.536
month[T.1] -2.2833 1.553 -1.470 0.142 -5.332 0.766
interaction[T.1] 2.7500 1.731 1.588 0.113 -0.649 6.149
state -2.9494 1.224 -2.409 0.016 -5.353 -0.546
==============================================================================
Omnibus: 212.243 Durbin-Watson: 1.835
Prob(Omnibus): 0.000 Jarque-Bera (JB): 761.734
Skew: 1.278 Prob(JB): 3.90e-166
Kurtosis: 7.155 Cond. No. 11.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The p-value for interaction is not statistically significant, which means that the average total employees per restaurant increased after the minimal salary raise by 2.75 FTE (full-time equivalent) but the result may be just due to random factors.
Comments