consistency and the linear probability model

an explainer about ordinary least squares regression and when it is an acceptable estimator

Author
Published

August 31, 2019

Summary

A while back Twitter once again lost its collective mind and decided to rehash the logistic regression versus linear probability model debate for the umpteenth time. The genesis for this new round of chaos was Gomila (), a pre-print by Robin Gomila, a grad student in psychology at Princeton.

So I decided to learn some more about the linear probability model (LPM), which has been on my todo list since Michael Weylandt suggested it as an example for a paper I’m working on. In this post I’ll introduce the LPM, and then spend a bunch of time discussing when OLS is a consistent estimator.

Update: I turned this blog post into somewhat flippant, meme-heavy slides.

What is the linear probability model?

Suppose we have outcomes Y{0,1} and fixed covariate vectors X. The linear probability model is a model, that is, a set of probability distributions that might have produced our observed data. In particular, the linear probability assumes that the data generating process looks like:

P(Y=1|X)={1β0+β1X1+...+βkXk>1β0+β1X1+...+βkXkβ0+β1X1+...+βkXk[0,1]0β0+β1X1+...+βkXk<0

Essentially we clip P(Y=1|X)=Xβ to [0,1] to make sure we get valid probabilities. We can contrast the LPM with the binomial GLM, where the assumption is that:

P(Y=1|X)=11+exp((β0+β1X1+...+βkXk))

At points throughout this post we will also be interested in Gaussian GLMs. In slightly different notation, we can write all of these models as follows:

Yi|XiBernoulli(min(1,max(0,Xiβ)))linear probability modelYi|XiBernoulli(logit1(Xiβ))logistic regression / binomial GLMYi|XiNormal(Xiβ,σ2)linear regression / Gaussian GLM

The OLS estimator

When people refer to the linear probability model, they are referring to using the Ordinary Least Squares estimator as an estimator for β, or using Xβ^OLS as an estimator for E(Y|X)=P(Y=1|X). The OLS estimator is:

β^OLS=(XX)1XY.

Most people have seen the OLS estimator derived as the MLE of a Gaussian linear model. Here we have binary data, which is definitely non-Gaussian, and this is aesthetically somewhat unappealing. The big question is whether or not using β^OLS on binary data actually works.

A bare minimum for inference

When we do inference, we pretty much always want two things to happen. Given our model (the assumptions we are willing to make about the data generating process):

  1. our estimand should be identified, and
  2. our estimator should be consistent.

Identification means that each possible value of the estimand should correspond to a distinct distribution in our probability model. When we say that we want a consistent estimator, we mean that our estimator should recover the estimand exactly with infinite data. All the estimands in the LPM debate are identified (to my knowledge), so this isn’t the big deal here. But consistency matters a lot!

Gomila () claims that β^OLS is unbiased and consistent for β, and attempts to demonstrate this two ways: (1) analytically, and (2) via simulation.

And this is the point that I got pretty confused, because the big question is: consistent under what model? Depending on who you ask, we could conceivably be assuming that the data comes from:

  • a Gaussian GLM (linear regression),
  • the linear probability model, or
  • a binomial GLM (logistic regression).

Consistency of the OLS estimator

The easiest case is when we assume that a Gaussian GLM (linear regression model) holds. In this case, β^OLS is unbiased and consistent. My preferred reference for this is Rencher and Schaalje ().

When the linear probability model holds, β^OLS is in general biased and inconsistent (Horrace and Oaxaca ()). However, Gomila () claims that OLS is unbiased and consistent. Over Twitter DM I clarified that this claim is with respect to the linear probability model. Gomila referred me to Wooldridge () for an analytic proof, and to his simulation results for empirical confirmation.

At this point the main result of Horrace and Oaxaca () becomes germane. It goes like this: the OLS estimator is consistent and unbiased under the linear probability model if XiTβ[0,1] for all i, otherwise the OLS estimator is biased and inconsistent. In fact, Wooldridge () makes the same point:

Unless the range of X is severely restricted, the linear probability model cannot be a good description of the population response probability P(Y=1|X).

Here are some simulations demonstrating this bias and inconsistency when the XiTβ[0,1] condition is violated. It’s pretty clear that OLS is in general biased and inconsistent under the linear probability model. I was curious why this wasn’t showing up in Gomila’s simulations, so I took a look at his code and it turned out he wasn’t simulating from diverse enough data generating processes.

We discussed this over Twitter DM, and Gomila has since updated the code, but I believe the new simulations still do not seriously violate the XiTβ[0,1] condition. I briefly played around with the code but then it was 2 AM and I didn’t understand DeclareDesign particularly well so I gave up. Many kudos to Gomila for posting his code for public review.

Anyway, the gist is that OLS is consistent under the linear probability model if XiTβ is between zero and one for all i.

What if logistic regression is the true model?

Another reasonable question is: what happens to β^OLS under a binomial GLM? The story here is pretty much the same: β^OLS is not consistent for β, but it often does a decent job of estimating E(Y|X) anyway (; ).

The intuition behind this comes from M-estimation, which we digress into momentarily. The idea is to observe that β^OLS is an M-estimator. To see this, recall that OLS is based on minimizing

L(X,Y,β)=12YXβ22

which has a global minimizer β0. The gradient,

L(X,Y,β)=(YXβ)X,

should be zero at β0 under any model with linear expectation (provided that Xi0 for all i). So we take ψ=L(X,Y,β0) as the function in our estimating equation, since E(ψ)=0. This lets us leverage standard results from M-estimation theory.

In particular, M-estimation theory tells us that β^OLS is consistent under any regular distribution F such that β0 is the unique solution to

EF[(YiXiTβ0)Xi]=0.

We can use this to derive a sufficient condition for consistency; namely OLS is consistent for β0 if

0=EF[(YXiTβ0)Xi]=EF[EF[(YXiTβ0)Xi|Xi]]=EF[EF[YXiTβ0|Xi]Xi].

So a sufficient condition for the consistency of OLS is that

EF(Y|Xi)=XiTβ0.

That is, if the expectation is linear in some model parameter β0, OLS is consistent for that parameter. If you are an economist you’ve probably seen this written as

EF(εi|Xi)=0,

where we take εi=YiE(Yi|Xi).

The sufficient condition is also a necessary condition, and if it is violated β^OLS will generally be asymptotically biased.

Returning from our digression into M-estimation, we note the condition that EF(Y|Xi)=XiTβ0 is shockingly weaker than the assumption we used to derive β^OLS, where we used the fact that Y followed a Gaussian distribution. For consistency, we only need an assumption on the first moment of the distribution of Y, rather than the full parametric form of the model, which specifies every moment E(Y2),E(Y3),E(Y4),... of Y.

Anyway, the expectation of a binomial GLM is logit1(Xiβ), and the expectation of the LPM is min(1,max(0,Xiβ)). For certain Xi and β, these expectations are very close to Xiβ. In these cases, estimates for E(Y|X) based on β^OLS will do well.

Looking at the expectations of the models, we can see they aren’t wildly different:

Note that, under the LPM, β^OLS can give us good estimates for β, but under logistic regression, we only get good estimates of E(Y|X).

So maybe don’t toss the LPM estimator out with the bath water. Sure, the thing is generally inconsistent and aesthetically offensive, but whatever, it works on occasion, and sometimes there will be other practical considerations that make this tradeoff reasonable.

Where the estimator comes from doesn’t matter

Bayes Twitter in particular had a number of fairly vocal LPM opponents, on the basis that β^OLS was derived under a model that can’t produce the observed data.

This might seem like a dealbreaker, but it doesn’t bother me. Where the estimator comes from doesn’t actually matter. If it has nice properties given the assumptions you are willing to make, you should use it! Estimators derived under unrealistic models often turn out to be good!

In a bit more detail, here’s how I think of model checking:

  1. There’s an estimand I want to know
  2. I make some assumptions about my data generating process
  3. I pick an estimator that has nice properties given this data generating process

The issue here is that my assumptions about the data generating process can be wrong. And if my modeling assumptions are wrong, then my estimator might not have the nice properties I want it to have, and this is bad.

The concern is that using β^OLS corresponds to making a bad modeling assumption. In particular, using a Gaussian model for binary data isn’t defensible.

That’s not really what’s going on though. Instead, we start by deriving an estimator under the linear regression model. Then, we show this estimator has nice properties under a new, different model. To do model checking, we need to test whether the new, different model holds. Whether or not the data comes from a linear regression is immaterial.

Nobody who likes using β^OLS is arguing that it’s reasonable to assume a Gaussian generative model for binary data. LPM proponents argue that the Gaussian modeling assumption is violated, but it isn’t violated in an important way. Look, they say, the key assumptions of a Gaussian model that we used to derive consistency results can still hold, even if some other assumptions get violated. This is exactly what the M-estimation approach formalizes.

What about uncertainty in our estimator?

So far we have established that β^OLS is at least occasionally a consistent estimator for P(Y=1|X) under the LPM and logistic regression.

In practice, we also care about the uncertainty in β^. We might want a confidence interval, or, God forbid, a p-value. So we should think about consistent estimators for Cov(β^). In the grand LPM debate, most people suggest using robust standard errors. For a while, it was unclear to me how these robust standard errors were derived and under what conditions they are consistent. The answer again comes from Boos and Stefanski (), and all we need for the consistency of robust standard errors for Cov(β^) is some moment conditions (the existence of (7.5) and (7.6) in the text), which linear regression, logistic regression, and the LPM all satisfy.

It only took several weeks of misreading Boos and Stefanski () and asking dumb questions on Twitter to figure this out. Thanks to Achim Zeileis, James E. Pustejovsky, Cyrus Samii for answering those questions. Note that White () is another canonical paper on robust standard errors, but it doesn’t click for me like the M-estimation framework does.

Takeaways

This post exists because I wasn’t sure when β^OLS was a consistent estimator, and the rest of the LPM debate seemed like a lot of noise until I could figure that out. I have three takeaways.

  1. Properties of estimators are always with respect to models, and it’s hard to discuss which estimator is best if you don’t clarify modeling assumptions.

  2. If there are multiple reasonable estimators, fit them all. If they result in substantively different conclusions: congratulations! Your data is trash and you can move on to a new project!

  3. If you really care about getting efficient, consistent estimates under weak assumptions, you should be doing TMLE or Double ML or burning your CPU to a crisp with BART.

Anyway, this concludes the “I taught myself about the linear probability model and hope to never mention it again” period of my life. I look forward to my mentions being a trash fire.

References

Battey, H. S., D. R. Cox, and M. V. Jackson. 2019. “On the Linear in Probability Model for Binary Data.” Royal Society Open Science 6 (5): 190067. https://doi.org/10.1098/rsos.190067.
Boos, Dennis D, and L. A Stefanski. 2013. Essential Statistical Inference. Vol. 120. Springer Texts in Statistics. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4614-4818-1.
Cox, D. R. 1958. “The Regression Analysis of Binary Sequences.” Journal of the Royal Statistical Society. Series B (Methodological) 20 (2): 215–42. http://www.jstor.org/stable/2983890.
Gomila, Robin. 2019. “Logistic or Linear? Estimating Causal Effects of Binary Outcomes Using Regression Analysis.” Preprint. PsyArXiv. https://doi.org/10.31234/osf.io/4gmbv.
Horrace, William C, and Ronald L Oaxaca. 2003. “New Wine in Old Bottles: A Sequential Estimation Technique for the LPM,” 43. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=383102.
Rencher, Alvin C., and G. Bruce Schaalje. 2008. Linear Models in Statistics. 2nd ed. Hoboken, N.J: Wiley-Interscience. http://www.utstat.toronto.edu/~brunner/books/LinearModelsInStatistics.pdf.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica 48 (4): 817. https://doi.org/10.2307/1912934.
Wooldridge, Jeffrey M. 2001. Econometric Analysis of Cross Section and Panel Data. https://mitpress.mit.edu/books/econometric-analysis-cross-section-and-panel-data-second-edition.

Footnotes

  1. An M-estimator θ^ is a solution to

    i=1nψ(Yi,θ^)=0

    for some function ψ. Think of ψ as a score function or the derivative of a proper scoring rule. If the true value θ0 is the unique solution to

    E(i=1nψ(Yi,θ0))=0

    then θ^θ0. So it’s relatively easy to check if an M-estimator is consistent. Additionally, under some regularity conditions, provided that

    A(θ0)=E(ψ(Yi,θ0))

    and

    B(θ0)=E(ψ(Yi,θ0)ψ(Yi,θ0)T)

    exist for all i=1,...,n, then θ^ is also asymptotically normal with variance A(θ0)1B(θ0){A(θ0)1}T. We also have that

    A(θ^)1B(θ^){A(θ^)1}TA(θ0)1B(θ0){A(θ0)1}T

    so life is pretty great and we can get asymptotic estimates of uncertainty for θ^.↩︎

  2. In a lot of experimental settings that compare categorical treatments, you fit a fully saturated model that allocates a full parameter to the mean of each combination of treatments. β^OLS is consistent in this setting under both logistic regression and the LPM.

    Let’s prove this for logistic regression first. Suppose we have groups 1,...,k and the data comes from a logistic regression model. We represent the groups in the model via one hot coding, omitting an intercept for identifability. That is, we assume

    P(Y=1|X)=11+exp((α1X1+...+αkXk))

    where Xj=1 if the observation comes from group j and is zero otherwise. Note that, if Xj=1, then Xi=0 for all ij. To verify that the mean function is correctly specified, and thus that OLS is consistent, we need E(Y|X)=XTβ.

    Put βj=11+exp(αj), which is the mean of the jth group. Recall that Xj is fixed, so that P(Xj=1)=Xj. Then:

    E(Y|X)=P(Y=1|X)=P(Y=1|X1=1)P(X1=1)+...+P(Y=1|Xk=1)P(Xk=1)=P(Y=1|X1=1)X1+...+P(Y=1|Xk=1)Xk=11+exp(α1)X1+...+11+exp(αk)Xk=β1X1+...+βkXk=XTβ

    Thus β^OLSβ. So the OLS estimates are consistent for the group means, and (β^OLS)j11+exp(αj).

    Okay, now we’re halfway done but we still need to show consistency under the LPM. We assume the model is

    P(Y=1|X)=min(1,max(0,γ1X1+...+γkXk))

    Putting βj=min(1,max(0,γj)) we repeat the same procedure almost verbatim and see:

    E(Y|X)=P(Y=1|X)=P(Y=1|X1=1)P(X1=1)+...+P(Y=1|Xk=1)P(Xk=1)=P(Y=1|X1=1)X1+...+P(Y=1|Xk=1)Xk=min(1,max(0,γ1))X1+...+min(1,max(0,γk))Xk=β1X1+...+βkXk=XTβ

    Thus β^OLSβ. So the OLS estimates are again consistent for the group means βj=min(1,max(0,γj)).↩︎

  3. Michael Weylandt suggested the following example as an illustration of this. Suppose we are going to compare just two groups, so Xi{0,1}, and the true model is P(Y=1|X)=logit1(12Xi). Consider the infinite data limit so we don’t have to worry about estimation error.

    We know that P(Yi=1|Xi=0)=0.731 and P(Yi=1|Xi=1)=0.269. β^logistic MLE=β=(1,2) by the consistency of the MLE. The OLS fit will yield P(Yi=1|Xi)=0.7310.462Xi. The predictions for E(Y|X)=P(Y=1|X) agree, but β^OLS=(0.731,0.4621) while β=(1,2).

    This is a concrete application of the result in the previous footnote: β^OLS being consistent for group means only results in β^OLS being consistent for the model parameters β if the model parameters are themselves the group means.↩︎

  4. I.I.D. assumptions are often violated in real life, for example!

    My advisor made an interesting point about critiquing assumptions the other day. He said that it isn’t enough to say “I don’t think that assumption holds” when evaluating a model or an estimator. Instead, to show than the assumption is unacceptable, he claims you need to demonstrate that making a particular assumption results in some form of undesirable inference.↩︎

  5. Gomila also makes an argument against using β^logistic MLE based on some papers that show it handles interactions poorly. This sort of misses the forest for the trees in my opinion. GLMs are generically dubious models because few empirical conditional expectations are actually linear. And maybe linear main effects won’t kill you, but modeling interactions with linear terms amounts to fitting global hyperplanes for interactions. This is undesirable because interactions are almost always local, not global. If you want to model interactions carefully, you need something flexible like a GAM or kernels or whatnot.↩︎

Reuse