Ever wondered how to estimate Fama-MacBeth or cluster-robust standard errors in R? It can actually be very easy.

First, for some background information read Kevin Goulding’s blog post, Mitchell Petersen’s programming advice, Mahmood Arai’s paper/note and code (there is an earlier version of the code with some more comments in it). For more formal references you may want to look into Thompson (2011, JFE) and Petersen (2008, WP). Both papers focus on estimating robust SE using Stata.

After extensively discussing this with Giovanni Millo, co-author of `'plm'`

, it turns out that released R packages (`'plm'`

, `'lmtest'`

, `'sandwich'`

) can readily estimate clustered SEs. The results are not exactly the same as the Stata output, since in `'plm'`

the options `'HC0'`

through `'HC4'`

for `'vcovHC()'`

do not use the exact same weighting (by a function of sample size) that Stata uses for small-sample correction. But the results are sensibly similar when using `'HC1'`

.

It should be easy to (almost exactly) replicate M. Petersen’s benchmark results using the following code.

Import M. Petersen’s test data.

require(foreign) require(plm) require(lmtest) test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

Estimate linear model using OLS. The second call estimates the Fama-MacBeth regression.

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year')) fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth

Define a function that would estimate robust SE with double-clustering.

##Double-clustering formula (Thompson, 2011) vcovDC <- function(x, ...){ vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - vcovHC(x, method="white1", ...) }

Estimate OLS standard errors, White standard errors, standard errors clustered by group, by time, and by group and time. Compare the R output with M. Petersen’s benchmark results from Stata.

> ##OLS, White and clustering > coeftest(fpm) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.028359 1.0466 0.2954 x 1.034833 0.028583 36.2041 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovHC(x, method="white1", type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.028361 1.0465 0.2954 x 1.034833 0.028395 36.4440 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.066952 0.4433 0.6576 x 1.034833 0.050550 20.4714 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.022189 1.3376 0.1811 x 1.034833 0.031679 32.6666 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.064580 0.4596 0.6458 x 1.034833 0.052465 19.7243 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As Giovanni interestingly pointed out to me (in a privately circulated draft paper), it seems that the Fama-MacBeth estimator is nothing more than what econometricians call the Mean Groups estimator, and `'plm'`

can readily estimate this. You only need to swap the `'group'`

and `'time'`

indices. (See `pmg()`

call above.)

> ##Fama-MacBeth > coeftest(fpmg) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.031278 0.023356 1.3392 0.1806 x 1.035586 0.033342 31.0599 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As a last remark, it may be a good idea to introduce a `type='HC5'`

, implementing the *exact* Stata small-sample correction procedure, to allow users to benchmark R output against Stata results.

June 7th, 2012 at 21:56

This is great, thanks!

June 9th, 2012 at 18:27

I agree that it is always nice to be able to benchmark results from R against other statistics programs (stata, SAS, SPSS, etc.). Good work here!

July 6th, 2012 at 07:41

I was able to replicate the results of Mitchell Petersen’s Stata code (http://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/). As discussed in that post, I based my code on that of Mahmood Arai and was pleasantly surprised that I could use the same routine for instrumental variable regressions.

August 17th, 2013 at 01:20

alas, a simple program to do this by hand is about 5 times the speed of the generic panel plm version :-(

September 4th, 2013 at 23:20

Is coeftest able to handle clustered SE on a non-index variable? In addition, is PLM able to hand more than 2 indices? ie firm ID, country, time index?

September 6th, 2013 at 21:45

I queried Giovanni Millo, co-author of ‘plm’, and here’s what he had to say:

“Yes it is, provided this IS declared as the index variable. Suppose you want to cluster on industry inside a financial dataset and the variable is called ‘indus’, then specify only one index name: ‘index=”indus”‘, in the call to either plm() or pdata.frame(). ‘plm’ will add a second index, called “time”, which is meaningless here but won’t hurt anybody, and every procedure clustering on industry will be feasible.

Handling more than two indices is in principle not difficult, but is not implemented yet as it requires a different approach to the interface than that of ‘plm’, where everything is designed for seamlessly handling datasets with 2 of them. I will probably add it “soon”.”

November 11th, 2013 at 23:31

However, if you want to calculate SEs clustered by country, for example, and firms are grouped within countries, simply defining index=”country” will not work. You will get the “duplicate couples (time-id)” error since there will be more than one observation per country-time pairing. Arai’s solution allows you to define any variable as the cluster variable, but his function does not work for a plm object. Any suggestions?

September 26th, 2014 at 18:27

I posted an example of the “clustering at higher level” approach on SO:

http://stackoverflow.com/questions/8389843/clustered-standard-errors-r-panel-data/11173348#11173348

November 16th, 2013 at 11:12

Another reply coming from Giovanni:

“The short answer to the “clustering at higher level” issue is:

(assume you have a pdata.frame ‘mydata’ with indices “ind” and “tind”, while “clst” is the clustering variable)

forget about the “true” panel structure, wipe it out and substitute it with:

> mydata mydata <- pdata.frame(mydata, index="clst")

# do not specify "time" as second index! they are not compatible (duplicated years)

you get a "fake" panel where you can cluster on "clst" as you wish. Notice, you still have the original index variables in 'mydata' for reverting.

A longer explanation:

the algebra is very simple, while the interface problem is nasty. Panel models objects in 'plm' bring a lot of info with them, including data used in estimation and indices. Therefore it is possible to apply a vcov() method to an estimated model getting robust SEs. Using a different clustering variable would mean finding it inside the model object; among the regressors, if it doesn't have to coincide with one of the indices! But it won't probably be a regressor; and a higher-level clustering variable is sure to be time-invariant at the individual index level, hence it would probably be discarded by the internals.

Stata has it simpler because of its limited – and limiting – approach of "one dataset at a time"; hence it can afford to specify one variable in the cluster(.) option, being sure it will be compatible. If I remember correctly, Arai's solution also does everything inside a same procedure. The challenge is to have interoperable "pieces" (objects) where each one brings the necessary information with it, and any function knows what to do. Hence in 'plm' we should probably think of a way of "keeping" the clustering variable inside the model object — but this would break too much stuff…

The above solution does not allow to cluster at higher level while adding individual effects at lower level. If one wants a RE or FE model with different clustering:

- estimate FE, RE model according to original indexing

- extract demeaned data with pmodel.response(), model.matrix(), merge with original dataframe, redefine "clst" as individual index

- estimate pooling model on demeaned data (equivalent to FE or RE on original data)

- do clustered inference according to new clustering variable

(this is left as exercise ;^))"

November 19th, 2013 at 17:24

Thanks a lot for the response. The last section was what I was trying to do, clustering at a higher level but keeping the individual effects at the lower level. I’ll try out this method!

September 26th, 2014 at 16:52

Christina, I was curious if you got anywhere with this approach. Did you manage to make it work?

September 26th, 2014 at 19:22

Thank you for checking in. Yes, it worked!

Here is the code I used:

library(plm)

demY<-pmodel.response(plm.model)

demX<-model.matrix(plm.model)

mod<-lm(demY~0+demX) #pooled model using time-demeaned data; same result as original FE plm model

clx(mod,1,data$COUNTY) #gives cluster-robust SEs at county level

^with the clx function derived from Arai's method for lm objects.

September 28th, 2014 at 19:44

Thank you, Christina, for sharing this here. If you haven’t noticed it yet, you may be interested in the StackOverflow link above which gives an example of “clustering at higher level” using plm.