Difference in Differences

April 6, 2024 3 minute read

Difference in Differences in Python

import pandas as pd
pd.options.display.float_format = '{:.2f}'.format

%precision %.2f

'%.2f'

df = pd.read_csv('./data/employment.csv')

The dataset is adapted from the dataset in Card and Krueger (1994), which estimates the causal effect of an increase in the state minimum wage on the employment.

On April 1, 1992, New Jersey raised the state minimum wage from 4.25 USD to 5.05 USD while the minimum wage in Pennsylvania stays the same at 4.25 USD. data about employment in fast-food restaurants in NJ (0) and PA (1) were collected in February 1992 and in November 1992. 384 restaurants in total after removing null values The calculation of DID is simple:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   state          384 non-null    int64  
 1   total_emp_feb  384 non-null    float64
 2   total_emp_nov  384 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 9.1 KB

df.head()

	total_emp_feb	total_emp_nov
0	40.50	24.00
1	13.75	11.50
2	8.50	10.50
3	34.00	20.00
4	24.00	35.50

df.state.value_counts()

1    309
0     75
Name: state, dtype: int64

New Jersey data

# New Jersey
df_nj = df[df.state == 0]
df_nj['delta'] = df_nj['total_emp_nov'] - df_nj['total_emp_feb']
print(f'the change of employment from feb to nov in New Jersey is {df_nj.delta.mean():.2f}')

the change of employment from feb to nov in New Jersey is -2.28

Pennsylvania

df_pa = df[df.state == 1]
df_pa['delta'] = df_pa['total_emp_nov'] - df_pa['total_emp_feb']
print(f'the change of employment from feb to nov in Pennsylvania is {df_pa.delta.mean():.2f}')

the change of employment from feb to nov in Pennsylvania is 0.47

The difference between New Jersey and Pennsylvania is 0.47 - (-2.28) = 2.75. However, how can we know that this is statistically significant? We use the linear regression.

𝑦 = 𝛽0 + 𝛽1∗state + 𝛽2∗month + 𝛽3∗(interaction) + 𝜀

state is 0 for the control group (New Jersey) and 1 for the treatment group (Pennsylvania) month is 0 for before (Feb) and 1 for after (Nov) we can insert the values of state and month using the table below and see that coefficient (𝛽3) of the interaction of state and month is the value for DID

df_train = pd.melt(df, id_vars=['state'], value_vars=['total_emp_feb', 'total_emp_nov'], var_name='month', value_name='total_emp')

df_train

	state	month	total_emp
0	0	total_emp_feb	40.50
1	0	total_emp_feb	13.75
2	0	total_emp_feb	8.50
3	0	total_emp_feb	34.00
4	0	total_emp_feb	24.00
...	...	...	...
763	1	total_emp_nov	23.75
764	1	total_emp_nov	17.50
765	1	total_emp_nov	20.50
766	1	total_emp_nov	20.50
767	1	total_emp_nov	25.00

768 rows × 3 columns

df_train.loc[df_train.month == 'total_emp_feb', 'month'] = 0
df_train.loc[df_train.month == 'total_emp_nov', 'month'] = 1
df_train['interaction'] = df_train['state'] * df_train['month']

df_train.state.value_counts()

1    618
0    150
Name: state, dtype: int64

df_train.month.value_counts()

0    384
1    384
Name: month, dtype: int64

df_train.month.value_counts()

0    384
1    384
Name: month, dtype: int64

df_train.interaction.value_counts()

0    459
1    309
Name: interaction, dtype: int64

import statsmodels.formula.api as sm
model = sm.ols('total_emp ~ state + month + interaction', data=df_train).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              total_emp   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     1.947
Date:                Thu, 25 Apr 2024   Prob (F-statistic):              0.121
Time:                        21:22:02   Log-Likelihood:                -2817.6
No. Observations:                 768   AIC:                             5643.
Df Residuals:                     764   BIC:                             5662.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           23.3800      1.098     21.288      0.000      21.224      25.536
month[T.1]          -2.2833      1.553     -1.470      0.142      -5.332       0.766
interaction[T.1]     2.7500      1.731      1.588      0.113      -0.649       6.149
state               -2.9494      1.224     -2.409      0.016      -5.353      -0.546
==============================================================================
Omnibus:                      212.243   Durbin-Watson:                   1.835
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              761.734
Skew:                           1.278   Prob(JB):                    3.90e-166
Kurtosis:                       7.155   Cond. No.                         11.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The p-value for interaction is not statistically significant, which means that the average total employees per restaurant increased after the minimal salary raise by 2.75 FTE (full-time equivalent) but the result may be just due to random factors.

Twitter Facebook LinkedIn

Difference in Differences

Difference in Differences in Python

New Jersey data

Pennsylvania

Comments

You May Also Enjoy

Simple Linear Regression Model in PyTorch and Tensorflow

Generalized Linear Models

Log Likelihood Estimation

Python Tips