Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R

Ever wondered how to estimate Fama-MacBeth or cluster-robust standard errors in R? It can actually be very easy.

First, for some background information read Kevin Goulding’s blog post, Mitchell Petersen’s programming advice, Mahmood Arai’s paper/note and code (there is an earlier version of the code with some more comments in it). For more formal references you may want to look into Thompson (2011, JFE) and Petersen (2008, WP). Both papers focus on estimating robust SE using Stata.

After extensively discussing this with Giovanni Millo, co-author of 'plm', it turns out that released R packages ('plm', 'lmtest', 'sandwich') can readily estimate clustered SEs. The results are not exactly the same as the Stata output, since in 'plm' the options 'HC0' through 'HC4' for 'vcovHC()' do not use the exact same weighting (by a function of sample size) that Stata uses for small-sample correction. But the results are sensibly similar when using 'HC1'.

It should be easy to (almost exactly) replicate M. Petersen’s benchmark results using the following code.

Import M. Petersen’s test data.

test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

Estimate linear model using OLS. The second call estimates the Fama-MacBeth regression.

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth

Define a function that would estimate robust SE with double-clustering.

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)

Estimate OLS standard errors, White standard errors, standard errors clustered by group, by time, and by group and time. Compare the R output with M. Petersen’s benchmark results from Stata.

> ##OLS, White and clustering
> coeftest(fpm)

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.028359  1.0466   0.2954    
x           1.034833   0.028583 36.2041   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovHC(x, method="white1", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.028361  1.0465   0.2954    
x           1.034833   0.028395 36.4440   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

As Giovanni interestingly pointed out to me (in a privately circulated draft paper), it seems that the Fama-MacBeth estimator is nothing more than what econometricians call the Mean Groups estimator, and 'plm' can readily estimate this. You only need to swap the 'group' and 'time' indices. (See pmg() call above.)

> ##Fama-MacBeth
> coeftest(fpmg)

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.031278   0.023356  1.3392   0.1806    
x           1.035586   0.033342 31.0599   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

As a last remark, it may be a good idea to introduce a type='HC5', implementing the exact Stata small-sample correction procedure, to allow users to benchmark R output against Stata results.

About these ads

13 responses to “Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R

  • Joe

    This is great, thanks!

  • Kevin Goulding

    I agree that it is always nice to be able to benchmark results from R against other statistics programs (stata, SAS, SPSS, etc.). Good work here!

  • iangow

    I was able to replicate the results of Mitchell Petersen’s Stata code (http://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/). As discussed in that post, I based my code on that of Mahmood Arai and was pleasantly surprised that I could use the same routine for instrumental variable regressions.

  • iaw4

    alas, a simple program to do this by hand is about 5 times the speed of the generic panel plm version :-(

  • Chris

    Is coeftest able to handle clustered SE on a non-index variable? In addition, is PLM able to hand more than 2 indices? ie firm ID, country, time index?

    • landroni

      I queried Giovanni Millo, co-author of ‘plm’, and here’s what he had to say:
      “Yes it is, provided this IS declared as the index variable. Suppose you want to cluster on industry inside a financial dataset and the variable is called ‘indus’, then specify only one index name: ‘index=”indus”‘, in the call to either plm() or pdata.frame(). ‘plm’ will add a second index, called “time”, which is meaningless here but won’t hurt anybody, and every procedure clustering on industry will be feasible.
      Handling more than two indices is in principle not difficult, but is not implemented yet as it requires a different approach to the interface than that of ‘plm’, where everything is designed for seamlessly handling datasets with 2 of them. I will probably add it “soon”.”

    • landroni

      Another reply coming from Giovanni:
      “The short answer to the “clustering at higher level” issue is:
      (assume you have a pdata.frame ‘mydata’ with indices “ind” and “tind”, while “clst” is the clustering variable)

      forget about the “true” panel structure, wipe it out and substitute it with:

      > mydata mydata <- pdata.frame(mydata, index="clst")
      # do not specify "time" as second index! they are not compatible (duplicated years)

      you get a "fake" panel where you can cluster on "clst" as you wish. Notice, you still have the original index variables in 'mydata' for reverting.

      A longer explanation:

      the algebra is very simple, while the interface problem is nasty. Panel models objects in 'plm' bring a lot of info with them, including data used in estimation and indices. Therefore it is possible to apply a vcov() method to an estimated model getting robust SEs. Using a different clustering variable would mean finding it inside the model object; among the regressors, if it doesn't have to coincide with one of the indices! But it won't probably be a regressor; and a higher-level clustering variable is sure to be time-invariant at the individual index level, hence it would probably be discarded by the internals.

      Stata has it simpler because of its limited – and limiting – approach of "one dataset at a time"; hence it can afford to specify one variable in the cluster(.) option, being sure it will be compatible. If I remember correctly, Arai's solution also does everything inside a same procedure. The challenge is to have interoperable "pieces" (objects) where each one brings the necessary information with it, and any function knows what to do. Hence in 'plm' we should probably think of a way of "keeping" the clustering variable inside the model object — but this would break too much stuff…

      The above solution does not allow to cluster at higher level while adding individual effects at lower level. If one wants a RE or FE model with different clustering:
      – estimate FE, RE model according to original indexing
      – extract demeaned data with pmodel.response(), model.matrix(), merge with original dataframe, redefine "clst" as individual index
      – estimate pooling model on demeaned data (equivalent to FE or RE on original data)
      – do clustered inference according to new clustering variable
      (this is left as exercise ;^))"

  • Christina

    Thanks a lot for the response. The last section was what I was trying to do, clustering at a higher level but keeping the individual effects at the lower level. I’ll try out this method!

    • landroni

      Christina, I was curious if you got anywhere with this approach. Did you manage to make it work?

      • Christina

        Thank you for checking in. Yes, it worked!

        Here is the code I used:

        mod<-lm(demY~0+demX) #pooled model using time-demeaned data; same result as original FE plm model
        clx(mod,1,data$COUNTY) #gives cluster-robust SEs at county level

        ^with the clx function derived from Arai's method for lm objects.

    • landroni

      Thank you, Christina, for sharing this here. If you haven’t noticed it yet, you may be interested in the StackOverflow link above which gives an example of “clustering at higher level” using plm.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: