Consider how a normal regression output on STATA looks.
We find that along with the coefficients, we get other values like standard errors, p values, confidence intervals, etc. So what is the purpose of these stats?
The standard assumption in stat-econ literature is as follows: i)the observed sample is a random representative sample from the entire population, ii) we are interested in the true population parameters.
What does this mean? We assume that we pick a random representative sample from a population. After running a regression, we get a particular coefficient on all the x’s. We assume that in repeated sampling, we would get a distribution of coefficients for the x’s. In fact, one of the fundamental properties of a good estimator is unbiasedness or that the Expected Value of the coefficient from the sampling distribution is equal to the true value, i.e. the true coefficient is equal to the expected value of the coefficient in repeated sampling. Although we rarely do repeated sampling in practice, we do check for the significance of the coefficients with a null hypothesis that they are equal to zero. Thus, in a normal regression framework, the population has the true coefficients and is quite distinct from the samples which have a distribution of coefficients.
So what happens when we do have the entire population?
This is a tricky question, as we usually always assume that the population is infinite and we can never have the entire population. Thus, in cases of finite population, like the 28 states of India, it becomes a little bit tricky as to how one should interpret the regression output. I will attempt to summarise two popular views in the stat-econ literature in this regard.
One view says that if you have the entire population, the p values and t statistics are irrelevant. The finite population is considered to be a fixed set of elements. The coefficients derived are the true relationship since you have the entire population and not a sample of the population. Thus, the concepts of hypothesis testing and significance become meaningless for an entire population. This is because the stats and tests are only relevant for a sample, and not the entire population. The caveat here is that the model has to be correctly specified.
Super Population / Underlying Process
This view states that the observations in the population are simply a sample from an infinite population. For example, if we’re looking at the 28 states in India, we can interpret the set of observations at one time period as a sample; and the set of observations across all time as the infinite super population.
This becomes important, especially if you want to make inferences not just about relationships today but also if you want the relationships to hold for similar groups in the future.
Another related but slightly different view would be that the observed outcomes in the population are the products of an underlying process. In that regard, what standard errors and other test stats capture become relevant. According to Abadie et al (2014) “there are for each unit missing potential outcomes for the treatment levels the unit was not exposed to.”
On a similar vein Wallis and Robers (1956) claim, “the totality of numbers that would result from indefinitely many repetitions of the same process of selecting objects, measuring or classifying them, and recording results.”
In these regard, test stats become important in understanding whether the coefficients generated are true or have been generated as a result of a chance outcome.
- Abadie A, Athey S, Imbens G, Wooldridge J (2014), Finite Population and Causal Standard Errors, NBER Working Paper Series.
Available at: http://www.nber.org/papers/w20325.pdf
- Asali M (2012), Can I make a regression model with the whole population? [Msg 1], ResearchGate.
Message posted: https://www.researchgate.net/post/Can_I_make_a_regression_model_with_the_whole_population
- Frick R (1998), Interpreting statistical testing: Process propensity, not population and random sampling, Behavioral Research Methods, Instruments, & Computers. Vol 30 (3), pp: 527-535
- Hartley H, Sielker Jr R (1975), A “Super Population-Viewpoint” for finite population sampling, Biometrics, Vol. 31 (2) , pp: 411-422
- Hidiroglou, Michael Arsene (1974), Estimation of regression parameters for finite populations , Iowa State University Digital Repository.
Available at: http://lib.dr.iastate.edu/rtd/5146
- January (2013), How to report data for an entire population? [Msg 1], CrossValidated.
Message posted to: https://stats.stackexchange.com/questions/70296/how-to-report-data-for-an-entire-population