Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R

Ever wondered how to estimate Fama-MacBeth or cluster-robust standard errors in R? It can actually be very easy.

First, for some background information read Kevin Goulding’s blog post, Mitchell Petersen’s programming advice, Mahmood Arai’s paper/note and code (there is an earlier version of the code with some more comments in it). For more formal references you may want to look into Thompson (2011, JFE) and Petersen (2008, WP). Both papers focus on estimating robust SE using Stata.

After extensively discussing this with Giovanni Millo, co-author of 'plm', it turns out that released R packages ('plm', 'lmtest', 'sandwich') can readily estimate clustered SEs. The results are not exactly the same as the Stata output, since in 'plm' the options 'HC0' through 'HC4' for 'vcovHC()' do not use the exact same weighting (by a function of sample size) that Stata uses for small-sample correction. But the results are sensibly similar when using 'HC1'.

It should be easy to (almost exactly) replicate M. Petersen’s benchmark results using the following code.

Import M. Petersen’s test data.

test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

Estimate linear model using OLS. The second call estimates the Fama-MacBeth regression.

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth

Define a function that would estimate robust SE with double-clustering.

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)

Estimate OLS standard errors, White standard errors, standard errors clustered by group, by time, and by group and time. Compare the R output with M. Petersen’s benchmark results from Stata.

> ##OLS, White and clustering
> coeftest(fpm)

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.028359  1.0466   0.2954    
x           1.034833   0.028583 36.2041   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovHC(x, method="white1", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.028361  1.0465   0.2954    
x           1.034833   0.028395 36.4440   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

As Giovanni interestingly pointed out to me (in a privately circulated draft paper), it seems that the Fama-MacBeth estimator is nothing more than what econometricians call the Mean Groups estimator, and 'plm' can readily estimate this. You only need to swap the 'group' and 'time' indices. (See pmg() call above.)

> ##Fama-MacBeth
> coeftest(fpmg)

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.031278   0.023356  1.3392   0.1806    
x           1.035586   0.033342 31.0599   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

As a last remark, it may be a good idea to introduce a type='HC5', implementing the exact Stata small-sample correction procedure, to allow users to benchmark R output against Stata results.


27 responses to “Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R

  • Joe

    This is great, thanks!

  • Kevin Goulding

    I agree that it is always nice to be able to benchmark results from R against other statistics programs (stata, SAS, SPSS, etc.). Good work here!

  • iangow

    I was able to replicate the results of Mitchell Petersen’s Stata code (http://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/). As discussed in that post, I based my code on that of Mahmood Arai and was pleasantly surprised that I could use the same routine for instrumental variable regressions.

  • iaw4

    alas, a simple program to do this by hand is about 5 times the speed of the generic panel plm version 😦

  • Chris

    Is coeftest able to handle clustered SE on a non-index variable? In addition, is PLM able to hand more than 2 indices? ie firm ID, country, time index?

    • landroni

      I queried Giovanni Millo, co-author of ‘plm’, and here’s what he had to say:
      “Yes it is, provided this IS declared as the index variable. Suppose you want to cluster on industry inside a financial dataset and the variable is called ‘indus’, then specify only one index name: ‘index=”indus”‘, in the call to either plm() or pdata.frame(). ‘plm’ will add a second index, called “time”, which is meaningless here but won’t hurt anybody, and every procedure clustering on industry will be feasible.
      Handling more than two indices is in principle not difficult, but is not implemented yet as it requires a different approach to the interface than that of ‘plm’, where everything is designed for seamlessly handling datasets with 2 of them. I will probably add it “soon”.”

    • landroni

      Another reply coming from Giovanni:
      “The short answer to the “clustering at higher level” issue is:
      (assume you have a pdata.frame ‘mydata’ with indices “ind” and “tind”, while “clst” is the clustering variable)

      forget about the “true” panel structure, wipe it out and substitute it with:

      > mydata mydata <- pdata.frame(mydata, index="clst")
      # do not specify "time" as second index! they are not compatible (duplicated years)

      you get a "fake" panel where you can cluster on "clst" as you wish. Notice, you still have the original index variables in 'mydata' for reverting.

      A longer explanation:

      the algebra is very simple, while the interface problem is nasty. Panel models objects in 'plm' bring a lot of info with them, including data used in estimation and indices. Therefore it is possible to apply a vcov() method to an estimated model getting robust SEs. Using a different clustering variable would mean finding it inside the model object; among the regressors, if it doesn't have to coincide with one of the indices! But it won't probably be a regressor; and a higher-level clustering variable is sure to be time-invariant at the individual index level, hence it would probably be discarded by the internals.

      Stata has it simpler because of its limited – and limiting – approach of "one dataset at a time"; hence it can afford to specify one variable in the cluster(.) option, being sure it will be compatible. If I remember correctly, Arai's solution also does everything inside a same procedure. The challenge is to have interoperable "pieces" (objects) where each one brings the necessary information with it, and any function knows what to do. Hence in 'plm' we should probably think of a way of "keeping" the clustering variable inside the model object — but this would break too much stuff…

      The above solution does not allow to cluster at higher level while adding individual effects at lower level. If one wants a RE or FE model with different clustering:
      – estimate FE, RE model according to original indexing
      – extract demeaned data with pmodel.response(), model.matrix(), merge with original dataframe, redefine "clst" as individual index
      – estimate pooling model on demeaned data (equivalent to FE or RE on original data)
      – do clustered inference according to new clustering variable
      (this is left as exercise ;^))"

  • Christina

    Thanks a lot for the response. The last section was what I was trying to do, clustering at a higher level but keeping the individual effects at the lower level. I’ll try out this method!

    • landroni

      Christina, I was curious if you got anywhere with this approach. Did you manage to make it work?

      • Christina

        Thank you for checking in. Yes, it worked!

        Here is the code I used:

        mod<-lm(demY~0+demX) #pooled model using time-demeaned data; same result as original FE plm model
        clx(mod,1,data$COUNTY) #gives cluster-robust SEs at county level

        ^with the clx function derived from Arai's method for lm objects.

    • landroni

      Thank you, Christina, for sharing this here. If you haven’t noticed it yet, you may be interested in the StackOverflow link above which gives an example of “clustering at higher level” using plm.

  • Paolo

    2 questions:
    -according to pmg documentation indexes is a vector (of length one or two) indicating the (individual and time) indexes. As such I was expecting index=c(“firmid”,”year”) rather than index=c(“year”,”firmid”)…
    -I’ve tried to code myself the Fama-Macbeth regression using the step-by-step approach described at both http://www.jasonhsu.org/uploads/1/0/0/7/10075125/two_stage_fama_macbeth.pdf and https://en.wikipedia.org/wiki/Fama%E2%80%93MacBeth_regression on Petersen’s data but ended up with different results than yours. Are you using a different approach?

    #step1: 500 time-series regressions
    betas <- matrix(nrow=500, ncol=2)
    colnames(betas) <- c("Intercept", "beta")
    for (i in 1:500) {
    betas[i,] <- coef(lm(y~x, test[test$firmid==i,]))
    test$betas <- rep(betas[,2],each=10)

    #step2: 10 cross-section regressions
    factor_premia <- matrix(nrow=10)
    colnames(factor_premia) = "Ret_Premium"
    for (y in 1:10) {
    factor_premia[y] <- coef(lm(y~betas, test[test$year==y,]))[2]

    #factor risk premia, stderr and t-stat
    factor_risk_premia <- mean(factor_premia)
    standard_error <- sd(factor_premia) / sqrt(10)
    t_stat <- factor_risk_premia / standard_error
    c(factor_risk_premia, standard_error, t_stat)

    [1] -0.03912278 0.03083767 -1.26866857

    • landroni

      I am not sure if anything is wrong with your code, and I’ve never tried to do Fama-Macbeth “by hand”, but the Fama-Macbeth estimates here are the same as those of Petersen, so I assume that the coding approach in this post is correct.

      according to pmg documentation indexes is a vector (of length one or two) indicating the (individual and time) indexes.

      Indeed. This is why we need to “trick” pmg() by swapping the order of the group and time indices. Hence to get Fama-Macbeth estimates we need to use index=c("year","firmid").

      According to ?pmg, “the standard Mean Groups estimator, [is] based on the average of individual time series regressions”. So it seems that using default setup (i.e. index=c(“firmid”,”year”)) we obtain a series of times series regressions for each group, and then take the average of the resulting estimates. By swapping the indexes, we get instead a series of cross-sectional regressions for each time period, and then take the average of the resulting estimates.

      For a different look at this topic, see also Accuracy of the different standard error estimates in R (based on Petersen’s simulation program).

      • Paolo

        Thanks for your feedback and the link.

        I have the feeling the methodology being used in pmg is different compared to what is usually done in finance for the Fama-MacBeth regression as per the definition given in Wikipedia.
        I will try to drop an email to the package maintainer of plm or Giovanni himself.

      • landroni

        Without having complete theoretical insight, it seems that theoretically Fama-Macbeth is indeed a fancy two-step setup, whereas empirically we can get away with a mere average of cross-section estimates.

        See also this draft paper by Giovanni Millo which discusses these issues: Robust Standard Error Estimators for Panel Models: A Unifying Approach

        “[A]s observed, Fama and MacBeth estimates are none other than a mean groups estimator where means are taken over time instead of, as is customary in panel time series econometrics, over individual observations. Therefore Petersen’s results can be replicated by swapping indices in the plm function pmg”

      • landroni


        On Petersen’s website I found this interesting programming note along with a Stata implementation of Fama-Macbeth. This seems very similar to what pmg() is doing:

        “Fama-MacBeth Standard Errors

        Stata does not contain a routine for estimating the coefficients and standard errors by Fama-MacBeth (that I know of), but I have written an ado file which you can download. The ado file fm.ado runs a cross-sectional regression for each year in the data set.”

  • finnoob

    I am new to this world and a bit confused. by pmg() doing Fama-MacBeth regressions, you mean that it does the two pass approach as in Fama Macbeth (1973) (https://en.wikipedia.org/wiki/Fama%E2%80%93MacBeth_regression) ?

  • jimmy

    Hello, How can I get Newey-West 1987 corrected standard errors from fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth ?

    • landroni

      It should work like coeftest(fpmg, vcov. = vcovNW), but unfortunately there is no method currently implemented. This might be addressed in a future release.

  • Zabbak

    Hi Landroni,
    Recetly, I am new to do R and just understood Fama-macbeth method. You show us a good example, really thanks for you.

    But I am really confused:

    fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
    fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth

    for this two line, I dont understand what are the index=c("year","firmid") and index=c('firmid', 'year') doing here? I mean I dont know what is the role this two index=() play? and why we need such index? and why the position is opposite?

    I just want to use Fama-MacBeth function to test Fama 3 factors model and CAPM model, I dont know what is the suitable position for my data, why we need to index the position? really confused.

    Many thanks for your help.

    • landroni

      The index argument is being passed directly to pdata.frame(), which is an intermediary step that you can do manually:

      test_fpm <- pdata.frame(test, index=c('firmid', 'year'))
      test_fpmg <- pdata.frame(test, index=c("year","firmid")) ##Fama-MacBeth

      In the first case, plm associates “individuals” with the ‘firmid’ var, and “time” with the ‘year’ var. In the second case we attempt to trick plm into doing the opposite (for the purpose of the Fama-MacBeth regressions), i.e. associate “individuals” with the ‘year’ var, and “time” with the ‘firmid’ var. This is then used internally by plm when estimating the requested regressions. See ?pdata.frame for more details and the plm vignette.

      • Zabbak

        Thank you Landroni, I am not very sure you meaning, but I guess your meaning is the first one is general time series regression and second is fama-mac. But i still confuse is that what is the ‘firmid’ doing here, and what is “individual” meaning here?

        Shall you show me an example? if y is countries gdp(gdp from 2001-2016 of American, China, England and more…etc), x is an explanatory var from 2001-2016, year is time (like from 2001-2016), what does firmid possible be?

        Many thanks for your help.

  • Alfie

    I am currently conducting a Fama and MacBeth regression using the method described here. However i run into a weird error:
    t test of coefficients:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) -0.0076154 0.0118524 -0.6425 0.5206

    I only get “NA” values for the x-values, and i have absolutely no idea why. Any help is highly appreciated.
    Regards, Alfie

    • landroni

      It is hard to say without a reproducible example. It is possible there is a linear combination and the `GOLD` coefficient is undefined. You may have better luck asking your question on http://stats.stackexchange.com, but only if you can provide a minimal reproducible example and can frame your question from a Statistics perspective…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: