Category: Econometrics

Generating OLS Results Manually via R

Statistical softwares and packages have made it extremely easy for people to run regression analyses. Packages like lm in R or the reg command on STATA give quick and well compiled results. With this ease, however, people often don’t know or forget how to actually conduct these analyses manually. In this article, we manually recreate regression results created via the lm package in R.

Please click here to read the article!

Standard Deviation vs. Standard Error

Standard deviation and Standard Errors have been concepts that I have often erroneously mixed up and have struggled to differentiate between. As a result, I made a small presentation clarifying the difference between the two concepts.

To read the presentation, please click here

Finite Populations: How do we think about them?

Consider how a normal regression output on STATA looks.

Sample Regression Output


We find that along with the coefficients, we get other values like standard errors, p values, confidence intervals, etc. So what is the purpose of these stats?

The standard assumption in stat-econ literature is as follows: i)the observed sample is a random representative sample from the entire population, ii) we are interested in the true population parameters.

What does this mean? We assume that we pick a random representative sample from a population. After running a regression, we get a particular coefficient on all the x’s. We assume that in repeated sampling, we would get a distribution of coefficients for the x’s. In fact, one of the fundamental properties of a good estimator is unbiasedness or that the Expected Value of the coefficient from the sampling distribution is equal to the true value, i.e. the true coefficient is equal to the expected value of the coefficient in repeated sampling. Although we rarely do repeated sampling in practice, we do check for the significance of the coefficients with a null hypothesis that they are equal to zero. Thus, in a normal regression framework, the population has the true coefficients and is quite distinct from the samples which have a distribution of coefficients.

So what happens when we do have the entire population?

This is a tricky question, as we usually always assume that the population is infinite and we can never have the entire population. Thus, in cases of finite population, like the 28 states of India, it becomes a little bit tricky as to how one should interpret the regression output. I will attempt to summarise two popular views in the stat-econ literature in this regard.


One view says that if you have the entire population, the p values and t statistics are irrelevant.  The finite population is considered to be a fixed set of elements. The coefficients derived are the true relationship since you have the entire population and not a sample of the population. Thus, the concepts of hypothesis testing and significance become meaningless for an entire population. This is because the stats and tests are only relevant for a sample, and not the entire population. The caveat here is that the model has to be correctly specified.

Super Population / Underlying Process

This view states that the observations in the population are simply a sample from an infinite population. For example, if we’re looking at the 28 states in India, we can interpret the set of observations at one time period as a sample; and the set of observations across all time as the infinite super population.

This becomes important, especially if you want to make inferences not just about relationships today but also if you want the relationships to hold for similar groups in the future.

Another related but slightly different view would be that the observed outcomes in the population are the products of an underlying process. In that regard, what standard errors and other test stats capture become relevant. According to Abadie et al (2014) “there are for each unit missing potential outcomes for the treatment levels the unit was not exposed to.”

On a similar vein Wallis and Robers (1956) claim, “the totality of numbers that would result from indefinitely many repetitions of the same process of selecting objects, measuring or classifying them, and recording results.”

In these regard, test stats become important in understanding whether the coefficients generated are true or have been generated as a result of a chance outcome.



  1. Abadie A, Athey S, Imbens G, Wooldridge J (2014), Finite Population and Causal Standard Errors, NBER Working Paper Series.
    Available at:
  2. Asali M (2012), Can I make a regression model with the whole population? [Msg 1], ResearchGate.
    Message posted:
  3. Frick R (1998), Interpreting statistical testing: Process propensity, not population and random sampling, Behavioral Research Methods, Instruments, & Computers. Vol 30 (3), pp: 527-535
  4. Hartley H, Sielker Jr R (1975), A “Super Population-Viewpoint” for finite population sampling, Biometrics, Vol. 31 (2) , pp: 411-422
  5. Hidiroglou, Michael Arsene (1974), Estimation of regression parameters for finite populations , Iowa State University Digital Repository.
    Available at:
  6. January (2013), How to report data for an entire population? [Msg 1], CrossValidated.
    Message posted to:



Simple Guide to Getting Orthogonalized Impulse Response Functions

Vector Auto-Regressions or VARs are used in time series analyses when there may be inter-dependencies or relationships among multiple time series.  For example, we may want to understand the relationship between GDP, current account balance, and inflation rates. Running a normal OLS regression would be inappropriate as each variable affects the other variables; OLS estimates would have an endogeneity problem or the estimates would be biased. In such scenarios, we use VAR methods.

Recently, I was working on some time series data that had the issues of reverse causality. As a result, I had to use a VAR model to get orthogonalized impulse response functions (OIRFs) in order to understand the relationship between variables. In the attached presentation, I describe the theory behind this orthogonalization as well as the steps to generate OIRFs on STATA.

To check the presentation, please click on this link

A to B or B to A? That is the Question: Be Careful about Reverse Causality

Douglas Adams, the author of The Hitchhiker’s Guide to the Galaxy, had said, “The complexities of cause and effect defy analysis.”  Human beings have, throughout history, often been stumped by these complexities.

In the middle ages, many Europeans believed that lice and good health were correlated. The reasoning was that sick people rarely had any lice on them.  Today we know, of course, that lice are extremely sensitive to body temperature and would leave the body of sick individuals. Thus, it was not that fewer lice caused sickness, but that sickness caused fewer lice.

Even Economics has suffered from these issues. Alfred Marshall, in the third edition of his Principles of Microeconomics, had stated the Giffen Paradox – the curious exception to the law of demand,
“There are however some exceptions. For instance, Mr Giffen has pointed out, a rise in the price of bread makes so large a drain on the resources of the poorer labouring families and raises so much the marginal utility of money to them, that they are forced to curtail their consumption of meat and the more expensive farinaceous foods: and, bread being still the cheapest food which they can get and will take, they consume more, and not less of it.”

However, for over a century, economists have struggled to get comprehensive empirical evidence in support of this hypothesis. The key issue, as several studies have pointed out, is that it could be higher consumption leading to higher prices – which may give the illusion of an upward sloping demand curve. Indeed, in a market economy, the price is an equilibrium of two equations – the demand equation and the supply equation. We must not forget that demand and price are positively correlated on the supply side of the equation.

Closer to the present, the issue continues to haunt us. In 2010, Harvard University economists Carmen Reinhart and Kenneth Rogoff published a paper titled ‘Growth in a Time of Debt’. One of their key findings was

..median growth rates for countries with public debt over 90 percent of GDP are roughly one percent lower than otherwise;….. Countries with debt-to-GDP ratios above 90 percent have a slightly negative average growth rate, in fact.”

Their research was extremely influential – it was used by many fiscal hawks in the west to make arguments about reigning in debt levels. George Osborne, former Chancellor of the Exchequer in the UK, had famously said, “As Rogoff and Reinhart demonstrate convincingly, all financial crises ultimately have their origins in one thing.”

Alas, research by other economists into their works showed glaring oversights. Technical errors (yes, there was one) aside, economists pointed out to how there could simply be a case of reverse causality – i.e., it was low growth that caused high debt levels instead (this was acknowledged by Reinhart and Rogoff).

So why is reverse causality an issue in econometrics? In econometrics, we try to avoid endogeneity as that gives us biased or inconsistent estimates. What endogeneity means is that the explanatory variable and the error terms are correlated.

So how does reverse causality lead to this situation? Picture this:

I am using Telecom Regulatory Authority of India quarterly data which has price (Average Revenue per User) and quantity (Minutes of Usage) data.

Suppose I have the following equations:

MoU = a + B*ARPU + error_1  ….. (i)

I would expect a negative correlation between MoU and ARPU based on the law of demand (ignoring omitted variables for the moment). However, I forget that higher consumption of minutes that leads to higher prices as well, i.e.:

ARPU = b + C*MoU + error_2 ……(ii)

So why is ARPU endogenous in the first equation? Simple, imagine a shock to, error_1, which leads to higher MoU. Since MoU would rise, by equation (ii), this would also lead to higher ARPU. As a result, a change in error_1 leads to a change in ARPU in equation (i). Thus, there is endogeneity.

Visually –
error_1 goes up -> MoU goes up -> ARPU goes up
Thus, error_1 and MoU are correlated!

Here, I try to run a simple OLS regression of ARPU on MoU and get the following results:-

Balanced Panel: n=22, T=28, N=616 
Residuals :    Min.  1st Qu.   Median  3rd Qu.     Max. 
              -102.730  -33.519  -12.179   26.099  250.750  
Coefficients :   Estimate Std. Error t-value  Pr(>|t|)   
 df_subset2$ARPU 1.371101   0.076506  17.921 < 2.2e-16 ***

So a simple OLS regression reveals that an increase in price has led to an increase in minutes of usage per user. But we now know the reason behind this counter-intuitive result.

Similar to my results,  Steven D. Levitt from the University of Chicago found that studies that look at the impact of greater police deployment at crime rates may wrongly get a positive correlation between the two. This is because it may be higher amounts of crime that lead to greater police deployment in an area, and thus, the results would be flawed.

In conclusion, it is very important to know that we’ve got the direction of our causation right in our model. Otherwise, it is very likely that are making incorrect inferences. However, in some cases, it may be unavoidable to get data with reverse causality or a simultaneity. What do we do then?  There is a technique known as Instrumental Variables estimation which is ideal for these kinds of situations. Steven D. Levitt used this technique to accurately study the effect of greater police deployment on crime rates.


  1. Cassidy J (2013), The Reinhart and Rogoff Controversy: A Summing Up, The New Yorker.
    Available at:
    (Accessed 20 October 2017)
  2. Europeans in the Middle Ages Believed Lice were a Sign of Good Health (2017), The Vintage News.
    Available at:
    (Accessed 20 October 2017)
  3. Jensen R and Miller N, Giffen Behavior and Subsistence Consumption, The American Economic Review,  Vol. 98, No. 4 (Sep., 2008), pp. 1553-1577
  4. Krugman P (2010), Reinhart and Rogoff are Confusing Me, The New York Times.
    Available at:
    (Accessed 20 October 2017)
  5. Krugman P (2013), Reinhart-Rogoff, Continued, The New York Times.
    Available at:
    (Accessed 20 October 2017)
  6. Performance Indicator Reports, Telecom Regulatory Authority of India.
    Available at:
    (Accessed 19 October 2017)
  7. Levitt S (1997), Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime, The American Economic Review, Volume 87, Issue 3 (Jun., 1997), 270-290.
  8. Lyons J (2013), George Osborne favourite “godfathers of austerity” economists agree to making error in research, Mirror
    Available at:
    (Accessed 20 October 2017)
  9. Reinhart C and Rogoff K (2010), Growth in a Time of Debt, working paper at National Bureau of Economic Research
    Available at:
    (Accessed 20 October 2017)
  10. Reinhart C and Rogoff K (2013), Reinhart-Rogoff Response to Critique, The Wall Street Journal.
    Available at:
    (Accessed 20 October 2017)
  11. Stigler G (1947), Notes on the History of the Giffen Paradox, The University of Chicago Press Journals, Vol. 55, No. 2 (Apr., 1947), pp. 152-156

Don’t Forget about the Losers: Why We Should Worry about Survivorship Bias

“Go to ABC University. Though the fee is a little expensive, you’re guaranteed a great job after that. Just look at Rajesh. He has got such a great job at XYZ Company after university.”

We often hear these kinds of statements in our day to day lives. We should always be wary of such statements and check if they are nothing but a case of survivorship bias.

Survivorship bias is a logical fallacy that results in people focusing on successful cases while overlooking unsuccessful cases, usually because of their lack of visibility. No one likes to focus on a loser, right?

Forget about going to university, don’t we also hear stories about how college drop-outs like Mark Zuckerberg and Steve Jobs went on to become extremely successful multi-millionaires. This article by The Atlantic beautifully illustrates the dangers in believing these seductive stories. Here’s an extract,

“Like any myth, this story has a kernel of truth: There are exceptional individuals whose hard work, determination, and intelligence make up for the lack of a college degree. If they could do it, one might think, why can’t everybody? Such a question ignores the outlier status of these exceptional drop-out entrepreneurs and innovators. Those who are able to achieve such success often rely on a set of skills already developed before they get to college….But what happens to young people without access to these important resources? For them, skipping college to pursue business success is like investing their savings in lottery tickets in the hopes they will be a multimillion-dollar winner… The reality is that the next college dropout will not be LeBron James, James Cameron, or Mark Zuckerberg. He will likely belong to the millions of college drop-outs you don’t hear the press singing about.”


In Economics and Finance

Survivorship bias can lead to severe flaws not just in our day to day logic but also in economic and financial analyses. Ajay Shah noted Indian Economist, on his blog, points out how people tend to over-estimate corporate earnings by using the current crop of current Nifty 50 companies as a proxy for Indian corporations. His logic is simple – the index undergoes periodic revisions where some companies are dropped out while some new ones are instated. Since the new companies usually tend to be doing better than the ones that drop out, any analysis that looks at the latest crop of companies to analyze earnings over a long time period will tend to over-estimate earnings growth.

Similarly, studies have shown how mutual fund companies tend to over-estimate their returns over time as they drop their under-performing funds.


Econometric Models

We also should be careful while dealing with econometric models. Economists tend to prefer balanced panel data, as they are conducive to a clean analysis. A balanced panel data is a data set where each individual has the same number of observations through time. However, in the real world, we often get unbalanced panels. Data can be missing for various reasons, and though we are often tempted to balance the panel and delete the unbalanced individuals, we should be careful. In fact, unless the causes behind the missing observations are entirely random, we should not balance the panel. We may end up deleting valuable information.

Take for example we have a dataset of firms over a 40 year time period. We see that some firms do not have observations over the entire period as they have closed down in the interim. Now if we delete these observations we will probably end up with a selection bias as there may be certain characteristics that these firms posses because of which they had to shut down; something we will miss out in our analysis when we delete them.

As an example, I used NYU Stern’s Swiss Railways data set, an unbalanced panel with 49 companies’ observations across 13 years. I estimated the following model:

Ln_Costs = a + B*(Network) + C*(Staff) + D*(Stops) + E*(Tunnel)

I wanted to test how firms’ costs were related to the network(total length of their railway network), Staff (number of employees), Stops (Number of stations), and Tunnel (A dummy for whether their network had tunnels of a large length).

Out of the 49 firms in question, only 37 were present for the entire period. Most of the other firms had exited before the final year of the data set.

I first conducted a balanced panel analysis and got the following results:

              Estimate Std. Error t-value  Pr(>|t|)
(Intercept) 1.0546e+01 1.1908e-01 88.5607 < 2.2e-16 ***
NETWORK    3.6950e-06 2.6079e-06  1.4168   0.15719
STAFF      2.2411e-03 4.6429e-04  4.8270  1.87e-06 ***
STOPS      9.6047e-03 3.7608e-03  2.5539   0.01096 *
TUNNEL     1.9544e-01 2.9842e-01  0.6549   0.51283


I then conducted an analysis on the unbalanced panel, using all the companies:

              Estimate Std. Error t-value  Pr(>|t|)
(Intercept) 1.0551e+01 1.0727e-01 98.3622 < 2.2e-16 ***
NETWORK     5.1834e-06 1.8577e-06  2.7902 0.0054354 **
STAFF       1.5593e-03 2.7549e-04  5.6603 2.345e-08 ***
STOPS       1.1440e-02 3.4306e-03  3.3348 0.0009062 ***
TUNNEL      2.7512e-01 2.4753e-01  1.1115 0.2668046


You can see how different the results are. NETWORK was not a significant coefficient for the balanced set, while it was significant and more positive for the unbalanced set which meant that it plays a larger role in total costs. Similarly, the coefficient on STOPS and STAFF reduced from a large coefficient to a smaller coefficient. The results probably indicate that the firms that ceased to exist had large networks but not too many stops and staff hired.

To conclude, we must always be careful before falling into the trap of survivorship bias – whether it’s in our daily lives or while conducting economic analyses.


References :

  1. Elton E, Gruber M, and Black C (1996), Survivorship Bias and Mutual Fund Performance, The Review of Financial Studies.
    Available at:
    (Accessed 24 September 2017)
  2. Shah A (2017), Indian Corporations have Weak Earnings Growth, Ajay Shah’s Blog.
    Available at:
    (Accessed 15 September 2017)
  3. Shermer M (2014), How the Survivorship Bias Distorts Reality, Scientific American.
    Available at:
    (Accessed 29 September 2017)
  4. Survivorship Bias, Investopedia
    Available at:
    (Accessed 1 October 2017)
  5. Swiss Railways, Panel Data Sets, Prof W. Greene – NYU Stern.
    Available at:
    (Accessed 28 September 2017)
  6. Zimmer R (2013), The Myth of the Successful College Dropout: Why it Could Make Millions of Young Americans Poorer, The Atlantic.
    Available at:
    (Accessed 1 October 2017)

Regression 101: Don’t Forget to add all the Ingredients to Bake the Cake (Omitted Variable Bias)

With the advent of new statistical software and tools, conducting a simple cross-sectional regression sounds quite easy. Right? Not really. Specifying an appropriate model is very important and no tool apart from the econometrician’s brain can help her do that. Speaking of brain, let me narrate an interesting anecdote.

Paul Broca was a renowned French neurologist, surgeon, and anthropologist in the 19th century. He was a believer that brain weight correlates to one’s intelligence. Based on the autopsies he had conducted in Paris hospitals, he found that female brains tend to weigh on an average 15% lesser than male brains and concluded that men are more intelligent than women.  While science has subsequently shown that his ideas about brain weight weren’t accurate (elephants’ and blue whales’ brains are several times heavier than human brains), but was his model correctly specified in the first place?

What I am hinting at is the case of an omitted variable bias. This is one of the most basic reasons why the results of a regression may not be relevant. The concept is simple enough – while conducting a regression, if the econometrician forgets to include a relevant explanatory variable in the model then the regression doesn’t give reliable results. In other words, the coefficients that one gets on the explanatory variables are biased/inconsistent and the model is said to have an endogeneity problem.

Allow me to demonstrate:

Suppose the real model is –

y = a + Bx + Cz + e

However, the econometrician forgets about z and instead runs –

y = a + Bx + e

And suppose z can be determined as a function of x in the form of:

z = v + Dx + u

Then the coefficients that the model actually gives out are actually:

y = a + Bx + C(v + Dx + u) + e


y = (a+Cv) + (B +CD)x + (e+Cu)

As we can clearly see, the coefficients on x are biased. While the true coefficient is B , but what the model would throw up is (B+CD) .

How does this relate to Paul Broca? Well the regression he ran was of the form:

Brain_Weight = a + B(Male) + n

Where Male was a dummy variable which took the value of 1 if the observation belonged to a male, and 0 if the observation belonged to a female. Here, was the differential between male and female average brain weights and he got a positive coefficient on that. However, the model suffered from omitted variable bias. For example, men tend to have more body mass than women. How would this affect the model? The new model now reads:

Brain_Weight = a + B(Male) + P(Body_Mass) + n

Using the earlier logic, we can conclude that the coefficient we get on B is incorrect. So what exactly was the bias on B, the key coefficient, in Broca’s specification?

E(B) = B + P((Cov(Male, Body_Mass)/Var(Male))

The bias is most likely to be positive as:

  • P, brain weight and body mass are positively correlated.
  • Males, on an average, tend to have more body mass than women.
  • Hence the part in red is positive, and the model would overestimate the coefficient of B

Similarly, we can argue that many more relevant variables like Age_at_Death were missed out by Paul Broca (men tend to die earlier than women).


Is it always a problem?

Omitting variables is not always a problem. The situations when it’s not a problem include:

  • The omitted variable is uncorrelated with the other explanatory variable.
  • The omitted variable is irrelevant or doesn’t determine the dependent variable.
  • More importantly, it depends on the underlying theoretical framework and question that the econometrician is trying to answer.

A practical example

I have used the BWGHT dataset from Wooldridge to illustrate an example. It has observations with the weight at birth of a baby and some explanatory variables. You can download the dataset here.

Now initially, I conduct a simple regression –

Ln(Weight_at_Birth) = a + B(Family_Income) + e

> lm (df$lbwght ~ df$lfaminc)
lm(formula = df$lbwght ~ df$lfaminc)
(Intercept)   df$lfaminc 
    4.69673      0.02061

This shows that the weight at birth of a child is positively correlated with his family income. Fair enough, but is there an omitted variable bias? I go on to include fatheduc variable which is a measure of father’s years of education.

The new model becomes:

Ln(Weight_at_Birth) = a + B(Family_Income) + J(Father’s_Education) + e

Which way would you guess the bias is? So we can say the fatheduc is positively correlated to both the child’s weight at birth and the family education. As a result,  should be biased upward.

> lm(df$lbwght ~ x_more)
lm(formula = df$lbwght ~ x_more)
(Intercept)  x_moreLnFamilyincome       x_moreFatheredu  
    4.672692              0.014738              0.003526 

And this is exactly what we find. We can clearly see that the coefficient on Family Income has reduced! (Haven’t focused on the standard errors on purpose here)

Practical Considerations

The practical issue with omitted variable bias is that the econometrician may not be aware that a relevant variable is being omitted, or the data for an omitted variable may simply not exist. For example, when testing the impact of education of wages, one may want to include a variable to control for ability. However, it would be practically difficult to get hold of such a variable.

There is no statistical test to check whether your model has an omitted variable. While RESET Test in Stata does have some functionality with regards to checking if higher order forms of already included variables have been omitted in your model, it doesn’t account for external omitted variables. More on this is here.

Correctly specifying models is much of an art as a science. There are much more such issues that I would love to cover in the days to come. Until next time!



  1. Must admit that I was introduced to Paul Broca’s story during my Econometrics lectures by Prof Vassilis Hajivassiliou at the LSE.
  2. BWGHT (2000), Wooldrige data sets.
    Available at:
    (Accessed 15 September 2017)
  3. Gujarati D, and Porter D, and Gunasekar S (2009), Basic Econometrics, 5th ed, New Delhi: Tata McGraw Hill, Pg: 495-498.
  4. Omitted Variable Tests.
    Available at:
    (Accessed 17 September 2017)
  5. Schreider E (1966), Brain weight Correlations calculated from original results of Paul Broca,
    American Journal of Physical Anthropology.
    Available at:
    (Acessed 16 September 2017)

© 2018 Sujan Bandyopadhyay

Theme by Anders NorenUp ↑