Ever wondered how to estimate Fama-MacBeth or cluster-robust standard errors in R? It can actually be very easy.

First, for some background information read Kevin Goulding’s blog post, Mitchell Petersen’s programming advice, Mahmood Arai’s paper/note and code (there is an earlier version of the code with some more comments in it). For more formal references you may want to look into Thompson (2011, JFE) and Petersen (2008, WP). Both papers focus on estimating robust SE using Stata.

After extensively discussing this with Giovanni Millo, co-author of `'plm'`

, it turns out that released R packages (`'plm'`

, `'lmtest'`

, `'sandwich'`

) can readily estimate clustered SEs. The results are not exactly the same as the Stata output, since in `'plm'`

the options `'HC0'`

through `'HC4'`

for `'vcovHC()'`

do not use the exact same weighting (by a function of sample size) that Stata uses for small-sample correction. But the results are sensibly similar when using `'HC1'`

.

It should be easy to (almost exactly) replicate M. Petersen’s benchmark results using the following code.

Import M. Petersen’s test data.

require(foreign) require(plm) require(lmtest) test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

Estimate linear model using OLS. The second call estimates the Fama-MacBeth regression.

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year')) fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth

Define a function that would estimate robust SE with double-clustering.

##Double-clustering formula (Thompson, 2011) vcovDC <- function(x, ...){ vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - vcovHC(x, method="white1", ...) }

Estimate OLS standard errors, White standard errors, standard errors clustered by group, by time, and by group and time. Compare the R output with M. Petersen’s benchmark results from Stata.

> ##OLS, White and clustering > coeftest(fpm) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.028359 1.0466 0.2954 x 1.034833 0.028583 36.2041 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovHC(x, method="white1", type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.028361 1.0465 0.2954 x 1.034833 0.028395 36.4440 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.066952 0.4433 0.6576 x 1.034833 0.050550 20.4714 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.022189 1.3376 0.1811 x 1.034833 0.031679 32.6666 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029680 0.064580 0.4596 0.6458 x 1.034833 0.052465 19.7243 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As Giovanni interestingly pointed out to me (in a privately circulated draft paper), it seems that the Fama-MacBeth estimator is nothing more than what econometricians call the Mean Groups estimator, and `'plm'`

can readily estimate this. You only need to swap the `'group'`

and `'time'`

indices. (See `pmg()`

call above.)

> ##Fama-MacBeth > coeftest(fpmg) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.031278 0.023356 1.3392 0.1806 x 1.035586 0.033342 31.0599 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As a last remark, it may be a good idea to introduce a `type='HC5'`

, implementing the *exact* Stata small-sample correction procedure, to allow users to benchmark R output against Stata results.

June 7th, 2012 at 21:56

This is great, thanks!

June 9th, 2012 at 18:27

I agree that it is always nice to be able to benchmark results from R against other statistics programs (stata, SAS, SPSS, etc.). Good work here!

July 6th, 2012 at 07:41

I was able to replicate the results of Mitchell Petersen’s Stata code (http://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/). As discussed in that post, I based my code on that of Mahmood Arai and was pleasantly surprised that I could use the same routine for instrumental variable regressions.

August 17th, 2013 at 01:20

alas, a simple program to do this by hand is about 5 times the speed of the generic panel plm version :-(

September 4th, 2013 at 23:20

Is coeftest able to handle clustered SE on a non-index variable? In addition, is PLM able to hand more than 2 indices? ie firm ID, country, time index?

September 6th, 2013 at 21:45

I queried Giovanni Millo, co-author of ‘plm’, and here’s what he had to say:

“Yes it is, provided this IS declared as the index variable. Suppose you want to cluster on industry inside a financial dataset and the variable is called ‘indus’, then specify only one index name: ‘index=”indus”‘, in the call to either plm() or pdata.frame(). ‘plm’ will add a second index, called “time”, which is meaningless here but won’t hurt anybody, and every procedure clustering on industry will be feasible.

Handling more than two indices is in principle not difficult, but is not implemented yet as it requires a different approach to the interface than that of ‘plm’, where everything is designed for seamlessly handling datasets with 2 of them. I will probably add it “soon”.”

November 11th, 2013 at 23:31

However, if you want to calculate SEs clustered by country, for example, and firms are grouped within countries, simply defining index=”country” will not work. You will get the “duplicate couples (time-id)” error since there will be more than one observation per country-time pairing. Arai’s solution allows you to define any variable as the cluster variable, but his function does not work for a plm object. Any suggestions?

September 26th, 2014 at 18:27

I posted an example of the “clustering at higher level” approach on SO:

http://stackoverflow.com/questions/8389843/clustered-standard-errors-r-panel-data/11173348#11173348

November 16th, 2013 at 11:12

Another reply coming from Giovanni:

“The short answer to the “clustering at higher level” issue is:

(assume you have a pdata.frame ‘mydata’ with indices “ind” and “tind”, while “clst” is the clustering variable)

forget about the “true” panel structure, wipe it out and substitute it with:

> mydata mydata <- pdata.frame(mydata, index="clst")

# do not specify "time" as second index! they are not compatible (duplicated years)

you get a "fake" panel where you can cluster on "clst" as you wish. Notice, you still have the original index variables in 'mydata' for reverting.

A longer explanation:

the algebra is very simple, while the interface problem is nasty. Panel models objects in 'plm' bring a lot of info with them, including data used in estimation and indices. Therefore it is possible to apply a vcov() method to an estimated model getting robust SEs. Using a different clustering variable would mean finding it inside the model object; among the regressors, if it doesn't have to coincide with one of the indices! But it won't probably be a regressor; and a higher-level clustering variable is sure to be time-invariant at the individual index level, hence it would probably be discarded by the internals.

Stata has it simpler because of its limited – and limiting – approach of "one dataset at a time"; hence it can afford to specify one variable in the cluster(.) option, being sure it will be compatible. If I remember correctly, Arai's solution also does everything inside a same procedure. The challenge is to have interoperable "pieces" (objects) where each one brings the necessary information with it, and any function knows what to do. Hence in 'plm' we should probably think of a way of "keeping" the clustering variable inside the model object — but this would break too much stuff…

The above solution does not allow to cluster at higher level while adding individual effects at lower level. If one wants a RE or FE model with different clustering:

– estimate FE, RE model according to original indexing

– extract demeaned data with pmodel.response(), model.matrix(), merge with original dataframe, redefine "clst" as individual index

– estimate pooling model on demeaned data (equivalent to FE or RE on original data)

– do clustered inference according to new clustering variable

(this is left as exercise ;^))"

November 19th, 2013 at 17:24

Thanks a lot for the response. The last section was what I was trying to do, clustering at a higher level but keeping the individual effects at the lower level. I’ll try out this method!

September 26th, 2014 at 16:52

Christina, I was curious if you got anywhere with this approach. Did you manage to make it work?

September 26th, 2014 at 19:22

Thank you for checking in. Yes, it worked!

Here is the code I used:

library(plm)

demY<-pmodel.response(plm.model)

demX<-model.matrix(plm.model)

mod<-lm(demY~0+demX) #pooled model using time-demeaned data; same result as original FE plm model

clx(mod,1,data$COUNTY) #gives cluster-robust SEs at county level

^with the clx function derived from Arai's method for lm objects.

September 28th, 2014 at 19:44

Thank you, Christina, for sharing this here. If you haven’t noticed it yet, you may be interested in the StackOverflow link above which gives an example of “clustering at higher level” using plm.

July 9th, 2015 at 19:03

2 questions:

-according to pmg documentation indexes is a vector (of length one or two) indicating the (individual and time) indexes. As such I was expecting index=c(“firmid”,”year”) rather than index=c(“year”,”firmid”)…

-I’ve tried to code myself the Fama-Macbeth regression using the step-by-step approach described at both http://www.jasonhsu.org/uploads/1/0/0/7/10075125/two_stage_fama_macbeth.pdf and https://en.wikipedia.org/wiki/Fama%E2%80%93MacBeth_regression on Petersen’s data but ended up with different results than yours. Are you using a different approach?

#mytest

#step1: 500 time-series regressions

betas <- matrix(nrow=500, ncol=2)

colnames(betas) <- c("Intercept", "beta")

for (i in 1:500) {

betas[i,] <- coef(lm(y~x, test[test$firmid==i,]))

}

test$betas <- rep(betas[,2],each=10)

#step2: 10 cross-section regressions

factor_premia <- matrix(nrow=10)

colnames(factor_premia) = "Ret_Premium"

for (y in 1:10) {

factor_premia[y] <- coef(lm(y~betas, test[test$year==y,]))[2]

}

#factor risk premia, stderr and t-stat

factor_risk_premia <- mean(factor_premia)

standard_error <- sd(factor_premia) / sqrt(10)

t_stat <- factor_risk_premia / standard_error

c(factor_risk_premia, standard_error, t_stat)

[1] -0.03912278 0.03083767 -1.26866857

July 10th, 2015 at 10:46

I am not sure if anything is wrong with your code, and I’ve never tried to do Fama-Macbeth “by hand”, but the Fama-Macbeth estimates here are the same as those of Petersen, so I assume that the coding approach in this post is correct.

according to pmg documentation indexes is a vector (of length one or two) indicating the (individual and time) indexes.Indeed. This is why we need to “trick”

`pmg()`

by swapping the order of the`group`

and`time`

indices. Hence to get Fama-Macbeth estimates we need to use`index=c("year","firmid")`

.According to

`?pmg`

,“the standard Mean Groups estimator, [is] based on the average of individual time series regressions”. So it seems that using default setup (i.e.`index=c(“firmid”,”year”)`

) we obtain a series of times series regressions for each group, and then take the average of the resulting estimates. By swapping the indexes, we get instead a series of cross-sectional regressions for each time period, and then take the average of the resulting estimates.For a different look at this topic, see also Accuracy of the different standard error estimates in R (based on Petersen’s simulation program).

July 10th, 2015 at 12:11

Thanks for your feedback and the link.

I have the feeling the methodology being used in pmg is different compared to what is usually done in finance for the Fama-MacBeth regression as per the definition given in Wikipedia.

I will try to drop an email to the package maintainer of plm or Giovanni himself.

July 10th, 2015 at 12:46

Without having complete theoretical insight, it seems that theoretically Fama-Macbeth is indeed a fancy two-step setup, whereas empirically we can get away with a mere average of cross-section estimates.

See also this draft paper by Giovanni Millo which discusses these issues: Robust Standard Error Estimators for Panel Models: A Unifying Approach

“[A]s observed, Fama and MacBeth estimates are none other than a mean groups estimator where means are taken over time instead of, as is customary in panel time series econometrics, over individual observations. Therefore Petersen’s results can be replicated by swapping indices in the plm function pmg”July 17th, 2015 at 11:07

@Paolo

On Petersen’s website I found this interesting programming note along with a Stata implementation of Fama-Macbeth. This seems very similar to what

`pmg()`

is doing:“Fama-MacBeth Standard ErrorsStata does not contain a routine for estimating the coefficients and standard errors by Fama-MacBeth (that I know of), but I have written an ado file which you can download. The ado file fm.ado runs a cross-sectional regression for each year in the data set.”