Category: Econometrics

Bootstrapping: An Introduction

“On first reading the bootstrap may seem a little like magic. But really it is not.” 

I have come across countless such quotes on the internet. I was confused myself, when I came across the concept for the very first time.  Bootstrapping is a concept that is widely used in Statistics, and Econometrics. The very basic and simple idea behind the bootstrapping methodology can be a bit confusing, and thus, I have created a small presentation on the topic.

To refer to the presentation, click here

The Whole Nine Months: Fertility Rates as Predictors of Economic Growth

[This article was published by on 9th April 2018. It was written in collaboration with Aniruddha Ghosh, a classmate from LSE. To read the article on, please click here ]

Featured on the Harvard Economics Review, click here ]


Nearly 10 years after the Great Recession of 2007-09 that brought monetary and real disturbances across the world economy, researchers have stumbled upon a new and a rather ingenious business cycle fact that may aid in predicting upturns and downturns in an economy: the growth rate of conceptions as predictors of economic growth.

While economists have uncovered such a relationship for the United States, a preliminary reading of Indian data also suggests similar patterns. Of course, a more in-depth and rigorous analysis is needed to unmask the trends. Such trends could be an aid in augmenting the much needed high-frequency indicators to aid policy making for the Indian economy.

It is a well-known fact that modern developed economies, as documented by most recent studies, have tended to show a positive association between fertility rates and economic growth: procyclical fertility. Patrick Galloway in his 1988 study on pre-industrial Europe highlighted the sharp responsiveness of fertility rates to the price of food grains. A higher price of food grains meant lower economic well-being, in turn negatively influencing fertility rates.

Moreover, the magnitude of positive association between fertility rates and economic growth was consistent across most of the pre-industrial European economies. While the fact remains that cross-country studies report a negative relationship between per-capita income and fertility levels, fertility rates indeed tend to be procyclical with economic movements. Indeed, statistics from the US’s National Center for Health Statistics show how the number of births began to decline in 2008 as the fallout of the Great Recession became prominent, falling by over 15% after hitting a 50-year high in 2007.
In addition to these facts, economists have deciphered a new fact about the movement of fertility rates. According to a recently published working paper at the National Bureau of Economic Research (NBER) – a leading economic research organisation in the US – fertility data not only moves in tandem with economic cycles but can even help predict the onset of a recession or an upturn. Studying quarterly economic and demographic data for the last three decades (1988-2015) , researchers Kansey Buckles and Daniel Hungerman from University of Notre Dame, and Steven Lugauer from University of Kentucky find that the growth rate of conceptions falls sharply during a recession, and importantly, this fall actually begins a few quarters before the onset of a recession.

In the jargon of macro-economics, the growth rate of conception is a leading indicator of business cycle, having predictive power in foretelling an impending downturn. Technically, conception is a measure based on births, therefore, it has a built-in nine month lag attached to it.

Figure 1: Growth Rate of Conceptions fall sharply during recessions, and this fall typically begins a few quarters before the onset of a recession (US 1988 Q1:2015 Q4). Source: Buckles, Hungerman, and Lugauer (2017)

This anticipatory nature of the growth rate of conceptions has been observed for all the major recessions in the last three decades, that of 1990, 2001, and 2008, and has been more pronounced in the recent recession. This high-frequency conception data allows authors to state a rather unconventional business cycle fact and we quote, “the growth rate of fertility declines prior to economic downturns and the decline occurs several quarters before recessions begin.” It is important to keep in mind that the fertility decline is solely driven by fall in conceptions for the US data rather than abortions or foetal deaths. The extent of fall in conceptions is large, growth rate of conceptions fell nearly 5 percentage points at the onset of Great Recession of 2007-08.

As mentioned above, the findings show that the growth rate of conceptions typically starts falling a few quarters before a recession, and since its is measure based on births, has an implicit 9 month lag built in to it. So what explains these possible associations? The hypothesis that researchers put forward is a straightforward one: Couples make forward looking decisions on having children based on economic outlook. If this hypothesis is true, then fertility behaviour in the US over the last three decades has been forward-looking and quick to respond to changes in the state of the economy. The explanation does depend on the assumption that almost all the conceptions are ‘planned’, i.e. couples are making conscious choices about bearing children, as opposed to unplanned, i.e. children born due to lack of contraception, out of wedlock, etc. With unplanned pregnancies being quite high in the US (estimates put it between 40-50% of all pregnancies), there is a legitimate question as to whether there are other factors in play which are causing this predictive relationship between growth rate of conceptions and economic growth.

Nevertheless, if future economic conditions matter for current conception decisions, and if expectations are at least rational or future incorporating, movements in current conceptions may likely be harbingers of future conditions.

Figure 2: Growth Rate of Conceptions and economic growth during the 2007-09 Great Recession. Source: Buckles, Hungerman, and Lugauer (2017)

That’s one part of the story. Interestingly, the researchers also find that the growth rate of conceptions tends to moves together with common economic indicators like consumer confidence and durables purchases, If you look at Figure 3, the fall in growth rate of conceptions coincides with or even precedes falls in consumer confidence and durable purchases. Consumer Confidence Index, Durable purchases are often looked upon as high frequency indicators that can foretell the prospects of the economy in the near term. In India, the Reserve Bank of India (RBI), Centre for Monitoring Indian Economy (CMIE) track independently consumer confidence which are useful aids in capturing expectations for the near term future of the economic situation.

Figure 3: Fall in conventional economic indicators like consumer confidence and purchase of durables, and the growth rate of conceptions tend to be coincide during recessions. Source: Buckles, Hungerman, and Lugauer (2017)

What does the trend seem like for India? We do a very preliminary exercise here and graph the growth of birth rate and GDP series for India. The graph points to some predictive behaviour but of course, this has to be subjected to more stringent and robust econometric evaluations. Also, we use birth rates, a measure different from conception rates that Buckles have used. However, as a first pass, this must interest practitioners to look carefully between the associations. Although, we think that the hypothesis that Buckles put forward may not hold tightly for India. With a low per capita income, a gloomy economic situation might step up conceptions as one needs more hands to feed. Again, all these are possible speculations and a careful analysis of the data will only uncover the mechanisms in place.

See Figure 4 here.

While whether conception rates or birth rates should be the variable of interest can be discussed, a common challenge with using both these measures is that they are hard to measure in real time, even when compared to other measures like consumer purchases. However, with the coming of sophisticated data analytics and statistical techniques, it is quite possible to get predictions of these measures on a real time basis. One possible way to track conceptions is to use purchases of goods that are especially likely to be bought by those who are attempting to conceive or who are newly pregnant. This data is usually tracked real time by retail firms using scanner technology. In fact, this data is already being used by retail firms for targeted marketing and advertising.

This New York Times piece in the year 2012 outlined how Target Corporation could observe, track and study consumer purchases and predict if a customer is pregnant. Most of the online portals place cookies on your devices to track your shopping habits. Therefore, there are tools that are available to harness real time data on conceptions. Of course, such tracking has to respect the contours of privacy since these are very private issues Buckles et. al explores few choices on tracking conceptions and interested readers are suggested to have a look at the appropriate sections.

Yogi Berra (the legendary Yankees Baseball player), famous for his Yogi-isms, had once quipped, “It’s tough to make predictions, especially about the future.” It’s encouraging for policy practitioners to have such an evidence of conception rates as predictors of economic cycles. To our knowledge, this study is one of its kind and should make researchers across the globe investigate for such associations in their respective geographical spaces. Policy institutions in India often cite the lack of high frequency indicators that can be an aid in the policy formulation process. Buckles should encourage them to hunt for innovative indicators and develop tools that can be usefully utilised. With so much jargon and tools being thrown around Big Data, it will be useful if researchers look at such meaningful constructs like Buckles do.





Generating OLS Results Manually via R

Statistical softwares and packages have made it extremely easy for people to run regression analyses. Packages like lm in R or the reg command on STATA give quick and well compiled results. With this ease, however, people often don’t know or forget how to actually conduct these analyses manually. In this article, we manually recreate regression results created via the lm package in R.

Please click here to read the article!

Standard Deviation vs. Standard Error

Standard deviation and Standard Errors have been concepts that I have often erroneously mixed up and have struggled to differentiate between. As a result, I made a small presentation clarifying the difference between the two concepts.

To read the presentation, please click here

Finite Populations: How do we think about them?

Consider how a normal regression output on STATA looks.

Sample Regression Output


We find that along with the coefficients, we get other values like standard errors, p values, confidence intervals, etc. So what is the purpose of these stats?

The standard assumption in stat-econ literature is as follows: i)the observed sample is a random representative sample from the entire population, ii) we are interested in the true population parameters.

What does this mean? We assume that we pick a random representative sample from a population. After running a regression, we get a particular coefficient on all the x’s. We assume that in repeated sampling, we would get a distribution of coefficients for the x’s. In fact, one of the fundamental properties of a good estimator is unbiasedness or that the Expected Value of the coefficient from the sampling distribution is equal to the true value, i.e. the true coefficient is equal to the expected value of the coefficient in repeated sampling. Although we rarely do repeated sampling in practice, we do check for the significance of the coefficients with a null hypothesis that they are equal to zero. Thus, in a normal regression framework, the population has the true coefficients and is quite distinct from the samples which have a distribution of coefficients.

So what happens when we do have the entire population?

This is a tricky question, as we usually always assume that the population is infinite and we can never have the entire population. Thus, in cases of finite population, like the 28 states of India, it becomes a little bit tricky as to how one should interpret the regression output. I will attempt to summarise two popular views in the stat-econ literature in this regard.


One view says that if you have the entire population, the p values and t statistics are irrelevant.  The finite population is considered to be a fixed set of elements. The coefficients derived are the true relationship since you have the entire population and not a sample of the population. Thus, the concepts of hypothesis testing and significance become meaningless for an entire population. This is because the stats and tests are only relevant for a sample, and not the entire population. The caveat here is that the model has to be correctly specified.

Super Population / Underlying Process

This view states that the observations in the population are simply a sample from an infinite population. For example, if we’re looking at the 28 states in India, we can interpret the set of observations at one time period as a sample; and the set of observations across all time as the infinite super population.

This becomes important, especially if you want to make inferences not just about relationships today but also if you want the relationships to hold for similar groups in the future.

Another related but slightly different view would be that the observed outcomes in the population are the products of an underlying process. In that regard, what standard errors and other test stats capture become relevant. According to Abadie et al (2014) “there are for each unit missing potential outcomes for the treatment levels the unit was not exposed to.”

On a similar vein Wallis and Robers (1956) claim, “the totality of numbers that would result from indefinitely many repetitions of the same process of selecting objects, measuring or classifying them, and recording results.”

In these regard, test stats become important in understanding whether the coefficients generated are true or have been generated as a result of a chance outcome.



  1. Abadie A, Athey S, Imbens G, Wooldridge J (2014), Finite Population and Causal Standard Errors, NBER Working Paper Series.
    Available at:
  2. Asali M (2012), Can I make a regression model with the whole population? [Msg 1], ResearchGate.
    Message posted:
  3. Frick R (1998), Interpreting statistical testing: Process propensity, not population and random sampling, Behavioral Research Methods, Instruments, & Computers. Vol 30 (3), pp: 527-535
  4. Hartley H, Sielker Jr R (1975), A “Super Population-Viewpoint” for finite population sampling, Biometrics, Vol. 31 (2) , pp: 411-422
  5. Hidiroglou, Michael Arsene (1974), Estimation of regression parameters for finite populations , Iowa State University Digital Repository.
    Available at:
  6. January (2013), How to report data for an entire population? [Msg 1], CrossValidated.
    Message posted to:



Simple Guide to Getting Orthogonalized Impulse Response Functions

Vector Auto-Regressions or VARs are used in time series analyses when there may be inter-dependencies or relationships among multiple time series.  For example, we may want to understand the relationship between GDP, current account balance, and inflation rates. Running a normal OLS regression would be inappropriate as each variable affects the other variables; OLS estimates would have an endogeneity problem or the estimates would be biased. In such scenarios, we use VAR methods.

Recently, I was working on some time series data that had the issues of reverse causality. As a result, I had to use a VAR model to get orthogonalized impulse response functions (OIRFs) in order to understand the relationship between variables. In the attached presentation, I describe the theory behind this orthogonalization as well as the steps to generate OIRFs on STATA.

To check the presentation, please click on this link

A to B or B to A? That is the Question: Be Careful about Reverse Causality

Douglas Adams, the author of The Hitchhiker’s Guide to the Galaxy, had said, “The complexities of cause and effect defy analysis.”  Human beings have, throughout history, often been stumped by these complexities.

In the middle ages, many Europeans believed that lice and good health were correlated. The reasoning was that sick people rarely had any lice on them.  Today we know, of course, that lice are extremely sensitive to body temperature and would leave the body of sick individuals. Thus, it was not that fewer lice caused sickness, but that sickness caused fewer lice.

Even Economics has suffered from these issues. Alfred Marshall, in the third edition of his Principles of Microeconomics, had stated the Giffen Paradox – the curious exception to the law of demand,
“There are however some exceptions. For instance, Mr Giffen has pointed out, a rise in the price of bread makes so large a drain on the resources of the poorer labouring families and raises so much the marginal utility of money to them, that they are forced to curtail their consumption of meat and the more expensive farinaceous foods: and, bread being still the cheapest food which they can get and will take, they consume more, and not less of it.”

However, for over a century, economists have struggled to get comprehensive empirical evidence in support of this hypothesis. The key issue, as several studies have pointed out, is that it could be higher consumption leading to higher prices – which may give the illusion of an upward sloping demand curve. Indeed, in a market economy, the price is an equilibrium of two equations – the demand equation and the supply equation. We must not forget that demand and price are positively correlated on the supply side of the equation.

Closer to the present, the issue continues to haunt us. In 2010, Harvard University economists Carmen Reinhart and Kenneth Rogoff published a paper titled ‘Growth in a Time of Debt’. One of their key findings was

..median growth rates for countries with public debt over 90 percent of GDP are roughly one percent lower than otherwise;….. Countries with debt-to-GDP ratios above 90 percent have a slightly negative average growth rate, in fact.”

Their research was extremely influential – it was used by many fiscal hawks in the west to make arguments about reigning in debt levels. George Osborne, former Chancellor of the Exchequer in the UK, had famously said, “As Rogoff and Reinhart demonstrate convincingly, all financial crises ultimately have their origins in one thing.”

Alas, research by other economists into their works showed glaring oversights. Technical errors (yes, there was one) aside, economists pointed out to how there could simply be a case of reverse causality – i.e., it was low growth that caused high debt levels instead (this was acknowledged by Reinhart and Rogoff).

So why is reverse causality an issue in econometrics? In econometrics, we try to avoid endogeneity as that gives us biased or inconsistent estimates. What endogeneity means is that the explanatory variable and the error terms are correlated.

So how does reverse causality lead to this situation? Picture this:

I am using Telecom Regulatory Authority of India quarterly data which has price (Average Revenue per User) and quantity (Minutes of Usage) data.

Suppose I have the following equations:

MoU = a + B*ARPU + error_1  ….. (i)

I would expect a negative correlation between MoU and ARPU based on the law of demand (ignoring omitted variables for the moment). However, I forget that higher consumption of minutes that leads to higher prices as well, i.e.:

ARPU = b + C*MoU + error_2 ……(ii)

So why is ARPU endogenous in the first equation? Simple, imagine a shock to, error_1, which leads to higher MoU. Since MoU would rise, by equation (ii), this would also lead to higher ARPU. As a result, a change in error_1 leads to a change in ARPU in equation (i). Thus, there is endogeneity.

Visually –
error_1 goes up -> MoU goes up -> ARPU goes up
Thus, error_1 and MoU are correlated!

Here, I try to run a simple OLS regression of ARPU on MoU and get the following results:-

Balanced Panel: n=22, T=28, N=616 
Residuals :    Min.  1st Qu.   Median  3rd Qu.     Max. 
              -102.730  -33.519  -12.179   26.099  250.750  
Coefficients :   Estimate Std. Error t-value  Pr(>|t|)   
 df_subset2$ARPU 1.371101   0.076506  17.921 < 2.2e-16 ***

So a simple OLS regression reveals that an increase in price has led to an increase in minutes of usage per user. But we now know the reason behind this counter-intuitive result.

Similar to my results,  Steven D. Levitt from the University of Chicago found that studies that look at the impact of greater police deployment at crime rates may wrongly get a positive correlation between the two. This is because it may be higher amounts of crime that lead to greater police deployment in an area, and thus, the results would be flawed.

In conclusion, it is very important to know that we’ve got the direction of our causation right in our model. Otherwise, it is very likely that are making incorrect inferences. However, in some cases, it may be unavoidable to get data with reverse causality or a simultaneity. What do we do then?  There is a technique known as Instrumental Variables estimation which is ideal for these kinds of situations. Steven D. Levitt used this technique to accurately study the effect of greater police deployment on crime rates.


  1. Cassidy J (2013), The Reinhart and Rogoff Controversy: A Summing Up, The New Yorker.
    Available at:
    (Accessed 20 October 2017)
  2. Europeans in the Middle Ages Believed Lice were a Sign of Good Health (2017), The Vintage News.
    Available at:
    (Accessed 20 October 2017)
  3. Jensen R and Miller N, Giffen Behavior and Subsistence Consumption, The American Economic Review,  Vol. 98, No. 4 (Sep., 2008), pp. 1553-1577
  4. Krugman P (2010), Reinhart and Rogoff are Confusing Me, The New York Times.
    Available at:
    (Accessed 20 October 2017)
  5. Krugman P (2013), Reinhart-Rogoff, Continued, The New York Times.
    Available at:
    (Accessed 20 October 2017)
  6. Performance Indicator Reports, Telecom Regulatory Authority of India.
    Available at:
    (Accessed 19 October 2017)
  7. Levitt S (1997), Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime, The American Economic Review, Volume 87, Issue 3 (Jun., 1997), 270-290.
  8. Lyons J (2013), George Osborne favourite “godfathers of austerity” economists agree to making error in research, Mirror
    Available at:
    (Accessed 20 October 2017)
  9. Reinhart C and Rogoff K (2010), Growth in a Time of Debt, working paper at National Bureau of Economic Research
    Available at:
    (Accessed 20 October 2017)
  10. Reinhart C and Rogoff K (2013), Reinhart-Rogoff Response to Critique, The Wall Street Journal.
    Available at:
    (Accessed 20 October 2017)
  11. Stigler G (1947), Notes on the History of the Giffen Paradox, The University of Chicago Press Journals, Vol. 55, No. 2 (Apr., 1947), pp. 152-156

Don’t Forget about the Losers: Why We Should Worry about Survivorship Bias

“Go to ABC University. Though the fee is a little expensive, you’re guaranteed a great job after that. Just look at Rajesh. He has got such a great job at XYZ Company after university.”

We often hear these kinds of statements in our day to day lives. We should always be wary of such statements and check if they are nothing but a case of survivorship bias.

Survivorship bias is a logical fallacy that results in people focusing on successful cases while overlooking unsuccessful cases, usually because of their lack of visibility. No one likes to focus on a loser, right?

Forget about going to university, don’t we also hear stories about how college drop-outs like Mark Zuckerberg and Steve Jobs went on to become extremely successful multi-millionaires. This article by The Atlantic beautifully illustrates the dangers in believing these seductive stories. Here’s an extract,

“Like any myth, this story has a kernel of truth: There are exceptional individuals whose hard work, determination, and intelligence make up for the lack of a college degree. If they could do it, one might think, why can’t everybody? Such a question ignores the outlier status of these exceptional drop-out entrepreneurs and innovators. Those who are able to achieve such success often rely on a set of skills already developed before they get to college….But what happens to young people without access to these important resources? For them, skipping college to pursue business success is like investing their savings in lottery tickets in the hopes they will be a multimillion-dollar winner… The reality is that the next college dropout will not be LeBron James, James Cameron, or Mark Zuckerberg. He will likely belong to the millions of college drop-outs you don’t hear the press singing about.”


In Economics and Finance

Survivorship bias can lead to severe flaws not just in our day to day logic but also in economic and financial analyses. Ajay Shah noted Indian Economist, on his blog, points out how people tend to over-estimate corporate earnings by using the current crop of current Nifty 50 companies as a proxy for Indian corporations. His logic is simple – the index undergoes periodic revisions where some companies are dropped out while some new ones are instated. Since the new companies usually tend to be doing better than the ones that drop out, any analysis that looks at the latest crop of companies to analyze earnings over a long time period will tend to over-estimate earnings growth.

Similarly, studies have shown how mutual fund companies tend to over-estimate their returns over time as they drop their under-performing funds.


Econometric Models

We also should be careful while dealing with econometric models. Economists tend to prefer balanced panel data, as they are conducive to a clean analysis. A balanced panel data is a data set where each individual has the same number of observations through time. However, in the real world, we often get unbalanced panels. Data can be missing for various reasons, and though we are often tempted to balance the panel and delete the unbalanced individuals, we should be careful. In fact, unless the causes behind the missing observations are entirely random, we should not balance the panel. We may end up deleting valuable information.

Take for example we have a dataset of firms over a 40 year time period. We see that some firms do not have observations over the entire period as they have closed down in the interim. Now if we delete these observations we will probably end up with a selection bias as there may be certain characteristics that these firms posses because of which they had to shut down; something we will miss out in our analysis when we delete them.

As an example, I used NYU Stern’s Swiss Railways data set, an unbalanced panel with 49 companies’ observations across 13 years. I estimated the following model:

Ln_Costs = a + B*(Network) + C*(Staff) + D*(Stops) + E*(Tunnel)

I wanted to test how firms’ costs were related to the network(total length of their railway network), Staff (number of employees), Stops (Number of stations), and Tunnel (A dummy for whether their network had tunnels of a large length).

Out of the 49 firms in question, only 37 were present for the entire period. Most of the other firms had exited before the final year of the data set.

I first conducted a balanced panel analysis and got the following results:

              Estimate Std. Error t-value  Pr(>|t|)
(Intercept) 1.0546e+01 1.1908e-01 88.5607 < 2.2e-16 ***
NETWORK    3.6950e-06 2.6079e-06  1.4168   0.15719
STAFF      2.2411e-03 4.6429e-04  4.8270  1.87e-06 ***
STOPS      9.6047e-03 3.7608e-03  2.5539   0.01096 *
TUNNEL     1.9544e-01 2.9842e-01  0.6549   0.51283


I then conducted an analysis on the unbalanced panel, using all the companies:

              Estimate Std. Error t-value  Pr(>|t|)
(Intercept) 1.0551e+01 1.0727e-01 98.3622 < 2.2e-16 ***
NETWORK     5.1834e-06 1.8577e-06  2.7902 0.0054354 **
STAFF       1.5593e-03 2.7549e-04  5.6603 2.345e-08 ***
STOPS       1.1440e-02 3.4306e-03  3.3348 0.0009062 ***
TUNNEL      2.7512e-01 2.4753e-01  1.1115 0.2668046


You can see how different the results are. NETWORK was not a significant coefficient for the balanced set, while it was significant and more positive for the unbalanced set which meant that it plays a larger role in total costs. Similarly, the coefficient on STOPS and STAFF reduced from a large coefficient to a smaller coefficient. The results probably indicate that the firms that ceased to exist had large networks but not too many stops and staff hired.

To conclude, we must always be careful before falling into the trap of survivorship bias – whether it’s in our daily lives or while conducting economic analyses.


References :

  1. Elton E, Gruber M, and Black C (1996), Survivorship Bias and Mutual Fund Performance, The Review of Financial Studies.
    Available at:
    (Accessed 24 September 2017)
  2. Shah A (2017), Indian Corporations have Weak Earnings Growth, Ajay Shah’s Blog.
    Available at:
    (Accessed 15 September 2017)
  3. Shermer M (2014), How the Survivorship Bias Distorts Reality, Scientific American.
    Available at:
    (Accessed 29 September 2017)
  4. Survivorship Bias, Investopedia
    Available at:
    (Accessed 1 October 2017)
  5. Swiss Railways, Panel Data Sets, Prof W. Greene – NYU Stern.
    Available at:
    (Accessed 28 September 2017)
  6. Zimmer R (2013), The Myth of the Successful College Dropout: Why it Could Make Millions of Young Americans Poorer, The Atlantic.
    Available at:
    (Accessed 1 October 2017)

Regression 101: Don’t Forget to add all the Ingredients to Bake the Cake (Omitted Variable Bias)

With the advent of new statistical software and tools, conducting a simple cross-sectional regression sounds quite easy. Right? Not really. Specifying an appropriate model is very important and no tool apart from the econometrician’s brain can help her do that. Speaking of brain, let me narrate an interesting anecdote.

Paul Broca was a renowned French neurologist, surgeon, and anthropologist in the 19th century. He was a believer that brain weight correlates to one’s intelligence. Based on the autopsies he had conducted in Paris hospitals, he found that female brains tend to weigh on an average 15% lesser than male brains and concluded that men are more intelligent than women.  While science has subsequently shown that his ideas about brain weight weren’t accurate (elephants’ and blue whales’ brains are several times heavier than human brains), but was his model correctly specified in the first place?

What I am hinting at is the case of an omitted variable bias. This is one of the most basic reasons why the results of a regression may not be relevant. The concept is simple enough – while conducting a regression, if the econometrician forgets to include a relevant explanatory variable in the model then the regression doesn’t give reliable results. In other words, the coefficients that one gets on the explanatory variables are biased/inconsistent and the model is said to have an endogeneity problem.

Allow me to demonstrate:

Suppose the real model is –

y = a + Bx + Cz + e

However, the econometrician forgets about z and instead runs –

y = a + Bx + e

And suppose z can be determined as a function of x in the form of:

z = v + Dx + u

Then the coefficients that the model actually gives out are actually:

y = a + Bx + C(v + Dx + u) + e


y = (a+Cv) + (B +CD)x + (e+Cu)

As we can clearly see, the coefficients on x are biased. While the true coefficient is B , but what the model would throw up is (B+CD) .

How does this relate to Paul Broca? Well the regression he ran was of the form:

Brain_Weight = a + B(Male) + n

Where Male was a dummy variable which took the value of 1 if the observation belonged to a male, and 0 if the observation belonged to a female. Here, was the differential between male and female average brain weights and he got a positive coefficient on that. However, the model suffered from omitted variable bias. For example, men tend to have more body mass than women. How would this affect the model? The new model now reads:

Brain_Weight = a + B(Male) + P(Body_Mass) + n

Using the earlier logic, we can conclude that the coefficient we get on B is incorrect. So what exactly was the bias on B, the key coefficient, in Broca’s specification?

E(B) = B + P((Cov(Male, Body_Mass)/Var(Male))

The bias is most likely to be positive as:

  • P, brain weight and body mass are positively correlated.
  • Males, on an average, tend to have more body mass than women.
  • Hence the part in red is positive, and the model would overestimate the coefficient of B

Similarly, we can argue that many more relevant variables like Age_at_Death were missed out by Paul Broca (men tend to die earlier than women).


Is it always a problem?

Omitting variables is not always a problem. The situations when it’s not a problem include:

  • The omitted variable is uncorrelated with the other explanatory variable.
  • The omitted variable is irrelevant or doesn’t determine the dependent variable.
  • More importantly, it depends on the underlying theoretical framework and question that the econometrician is trying to answer.

A practical example

I have used the BWGHT dataset from Wooldridge to illustrate an example. It has observations with the weight at birth of a baby and some explanatory variables. You can download the dataset here.

Now initially, I conduct a simple regression –

Ln(Weight_at_Birth) = a + B(Family_Income) + e

> lm (df$lbwght ~ df$lfaminc)
lm(formula = df$lbwght ~ df$lfaminc)
(Intercept)   df$lfaminc 
    4.69673      0.02061

This shows that the weight at birth of a child is positively correlated with his family income. Fair enough, but is there an omitted variable bias? I go on to include fatheduc variable which is a measure of father’s years of education.

The new model becomes:

Ln(Weight_at_Birth) = a + B(Family_Income) + J(Father’s_Education) + e

Which way would you guess the bias is? So we can say the fatheduc is positively correlated to both the child’s weight at birth and the family education. As a result,  should be biased upward.

> lm(df$lbwght ~ x_more)
lm(formula = df$lbwght ~ x_more)
(Intercept)  x_moreLnFamilyincome       x_moreFatheredu  
    4.672692              0.014738              0.003526 

And this is exactly what we find. We can clearly see that the coefficient on Family Income has reduced! (Haven’t focused on the standard errors on purpose here)

Practical Considerations

The practical issue with omitted variable bias is that the econometrician may not be aware that a relevant variable is being omitted, or the data for an omitted variable may simply not exist. For example, when testing the impact of education of wages, one may want to include a variable to control for ability. However, it would be practically difficult to get hold of such a variable.

There is no statistical test to check whether your model has an omitted variable. While RESET Test in Stata does have some functionality with regards to checking if higher order forms of already included variables have been omitted in your model, it doesn’t account for external omitted variables. More on this is here.

Correctly specifying models is much of an art as a science. There are much more such issues that I would love to cover in the days to come. Until next time!



  1. Must admit that I was introduced to Paul Broca’s story during my Econometrics lectures by Prof Vassilis Hajivassiliou at the LSE.
  2. BWGHT (2000), Wooldrige data sets.
    Available at:
    (Accessed 15 September 2017)
  3. Gujarati D, and Porter D, and Gunasekar S (2009), Basic Econometrics, 5th ed, New Delhi: Tata McGraw Hill, Pg: 495-498.
  4. Omitted Variable Tests.
    Available at:
    (Accessed 17 September 2017)
  5. Schreider E (1966), Brain weight Correlations calculated from original results of Paul Broca,
    American Journal of Physical Anthropology.
    Available at:
    (Acessed 16 September 2017)

© 2019 Sujan Bandyopadhyay

Theme by Anders NorenUp ↑