With the advent of new statistical software and tools, conducting a simple cross-sectional regression sounds quite easy. Right? Not really. Specifying an appropriate model is very important and no tool apart from the econometrician’s brain can help her do that. Speaking of brain, let me narrate an interesting anecdote.

Paul Broca was a renowned French neurologist, surgeon, and anthropologist in the 19th century. He was a believer that brain weight correlates to one’s intelligence. Based on the autopsies he had conducted in Paris hospitals, he found that female brains tend to weigh on an average 15% lesser than male brains and concluded that men are more intelligent than women.  While science has subsequently shown that his ideas about brain weight weren’t accurate (elephants’ and blue whales’ brains are several times heavier than human brains), but was his model correctly specified in the first place?

What I am hinting at is the case of an omitted variable bias. This is one of the most basic reasons why the results of a regression may not be relevant. The concept is simple enough – while conducting a regression, if the econometrician forgets to include a relevant explanatory variable in the model then the regression doesn’t give reliable results. In other words, the coefficients that one gets on the explanatory variables are biased/inconsistent and the model is said to have an endogeneity problem.

Allow me to demonstrate:

Suppose the real model is –

y = a + Bx + Cz + e

However, the econometrician forgets about z and instead runs –

y = a + Bx + e

And suppose z can be determined as a function of x in the form of:

z = v + Dx + u

Then the coefficients that the model actually gives out are actually:

y = a + Bx + C(v + Dx + u) + e

Or:

y = (a+Cv) + (B +CD)x + (e+Cu)

As we can clearly see, the coefficients on x are biased. While the true coefficient is B , but what the model would throw up is (B+CD) .

How does this relate to Paul Broca? Well the regression he ran was of the form:

Brain_Weight = a + B(Male) + n

Where Male was a dummy variable which took the value of 1 if the observation belonged to a male, and 0 if the observation belonged to a female. Here, was the differential between male and female average brain weights and he got a positive coefficient on that. However, the model suffered from omitted variable bias. For example, men tend to have more body mass than women. How would this affect the model? The new model now reads:

Brain_Weight = a + B(Male) + P(Body_Mass) + n

Using the earlier logic, we can conclude that the coefficient we get on B is incorrect. So what exactly was the bias on B, the key coefficient, in Broca’s specification?

E(B) = B + P((Cov(Male, Body_Mass)/Var(Male))

The bias is most likely to be positive as:

  • P, brain weight and body mass are positively correlated.
  • Males, on an average, tend to have more body mass than women.
  • Hence the part in red is positive, and the model would overestimate the coefficient of B

Similarly, we can argue that many more relevant variables like Age_at_Death were missed out by Paul Broca (men tend to die earlier than women).

 

Is it always a problem?

Omitting variables is not always a problem. The situations when it’s not a problem include:

  • The omitted variable is uncorrelated with the other explanatory variable.
  • The omitted variable is irrelevant or doesn’t determine the dependent variable.
  • More importantly, it depends on the underlying theoretical framework and question that the econometrician is trying to answer.

A practical example

I have used the BWGHT dataset from Wooldridge to illustrate an example. It has observations with the weight at birth of a baby and some explanatory variables. You can download the dataset here.

Now initially, I conduct a simple regression –

Ln(Weight_at_Birth) = a + B(Family_Income) + e

> lm (df$lbwght ~ df$lfaminc)
Call:
lm(formula = df$lbwght ~ df$lfaminc)
Coefficients:
(Intercept)   df$lfaminc 
    4.69673      0.02061

This shows that the weight at birth of a child is positively correlated with his family income. Fair enough, but is there an omitted variable bias? I go on to include fatheduc variable which is a measure of father’s years of education.

The new model becomes:

Ln(Weight_at_Birth) = a + B(Family_Income) + J(Father’s_Education) + e

Which way would you guess the bias is? So we can say the fatheduc is positively correlated to both the child’s weight at birth and the family education. As a result,  should be biased upward.

> lm(df$lbwght ~ x_more)
Call:
lm(formula = df$lbwght ~ x_more)
Coefficients:
(Intercept)  x_moreLnFamilyincome       x_moreFatheredu  
    4.672692              0.014738              0.003526 

And this is exactly what we find. We can clearly see that the coefficient on Family Income has reduced! (Haven’t focused on the standard errors on purpose here)

Practical Considerations

The practical issue with omitted variable bias is that the econometrician may not be aware that a relevant variable is being omitted, or the data for an omitted variable may simply not exist. For example, when testing the impact of education of wages, one may want to include a variable to control for ability. However, it would be practically difficult to get hold of such a variable.

There is no statistical test to check whether your model has an omitted variable. While RESET Test in Stata does have some functionality with regards to checking if higher order forms of already included variables have been omitted in your model, it doesn’t account for external omitted variables. More on this is here.

Correctly specifying models is much of an art as a science. There are much more such issues that I would love to cover in the days to come. Until next time!

 

References:

  1. Must admit that I was introduced to Paul Broca’s story during my Econometrics lectures by Prof Vassilis Hajivassiliou at the LSE.
  2. BWGHT (2000), Wooldrige data sets.
    Available at: http://fmwww.bc.edu/ec-p/data/wooldridge/datasets.list.html
    (Accessed 15 September 2017)
  3. Gujarati D, and Porter D, and Gunasekar S (2009), Basic Econometrics, 5th ed, New Delhi: Tata McGraw Hill, Pg: 495-498.
  4. Omitted Variable Tests.
    Available at:  http://personal.rhul.ac.uk/uhte/006/ec5040/Omitted%20Variable%20Tests.pdf
    (Accessed 17 September 2017)
  5. Schreider E (1966), Brain weight Correlations calculated from original results of Paul Broca,
    American Journal of Physical Anthropology.
    Available at: http://onlinelibrary.wiley.com/doi/10.1002/ajpa.1330250207/abstract
    (Acessed 16 September 2017)