A Crash Course on Estimation — A Data Science Approach

Vahid Naghshin
11 min readNov 23, 2021

--

Estimation of the parameters of the population is the foundation of the statistical inference.

Statisticians usually wish to make a inference or draw a conclusion about the population, which is defined as the all the possible observations of interest. The collection of observations taken from the population is called samples and the number of samples is called sample size which is usually denoted by n. The samples are subset of all possible observations, i.e., population. The measured characteristics of the samples are called statistics such as mean statistics, or median statistics. The characteristics of the population is called the population parameter. The population parameter is hidden as we do not access to the whole population. This can be as a result of non-feasable access to every possible observations of the population both from economical and complexity point of view.

The basic method of collecting samples out of an intended population is called simple random sampling. This means that every observation collected in samples has the same chance of others for being selected. The aim is to collect in a way that no systematic bias is lurked into the resultant samples. If the sample is not random, we are not quite sure if it is the representative of the population we intend to capture.

The population should be defined as the very first starting point in drawing conclusion. The temporal and spatial limit of the population should be characterised to be reflected in the statistical inference limits, too. From one perspective of treatment, the population parameters such as mean is assumed to be fixed (for ergodic distributions) and so it cannot be considered as a random variable. This treatment is usually known as frequentist approach in statisticians’ jargon. This contradicts with Bayesian approach where we assume that the population parameters are random variable. However, for the samples it is a different story. The sample statistics are usually random variables as they are function of the sampling process and hence there is a probability attached to it. So, they do have a probability distribution called sampling distribution.

What do you mean by a good estimator?

A good estimator of a population parameter should have the following characteristics:

  1. It should have no bias meaning that the expected value of the sampling distribution should approach to the true value of the population parameter.
  2. It should be consistent as the sample size increases so it approaches to the true value of the population parameter.
  3. It should be efficient, meaning that it should have the lowest variance among the competing estimator. For example, the sample mean is a more efficient estimator of the population mean of a variable with a normal probability distribution than the sample median, despite the two statistics being numerically equivalent.

Common Parameters and Statistics

Consider a population of observations of the variable Y measured on all N sampling units in the population. We take a random sample of n observations (y1, y2, y3,…yi,…yn) from the population. We are usually interested in inferring two common related statistics. The location or central tendency (Where is the middle of distribution or what is a typical value of a population) and spread (how different are the observations in a population).

Centre (location) of parameters

Estimators for the location of the distribution are classified into three different classes.

  1. The firs is L-estimator where the observations are ordered from the smallest to the largest and then perform a linear weighted of order statistics. The sample mean is an L-estimator where the weight for all observation is the same (1/n). Other common L-estimator are median, the trimmed mean, and Winsorized mean.
  2. Second are M-estimator, where the weighting is changed gradually from the middle of sample and incorporates a measure of variability in the estimation procedure. They include the Huber M- estimator and the Hampel M-estimator, which use different functions to weight the observations.

3. Finally, R estimator are based on the ranks of data rather than observation itself. and forms many rank-based “non-parametric” tests. The only common R-estimator is the Hodges–Lehmann estimator, which is the median of the averages of all possible pairs of observations.

Spread or variability

Various measure of spread are available to quantify the spread in the samplings. The range measure which is the difference between the smallest and largest value of the samplings is one indicator of spread in data. However, it is usually difficult to infer the range of the population from the sampling as the sample size increases the range changes (increases)dramatically.

The sample variance is an important of measure of spread in statistics. The variance is the average of square of differences between observed points and the mean of observations. In order to match the unit of the observations with spread measure, the square root of variance, called standard deviation, denoted by lower Sigma, is usually used.

The coefficient of variation which is the ratio of standard deviation and mean of observations are used to compare two populations with different mean. It makes the comparison independent of unit. Other spread measure we can consider the median absolute deviation (MAD), which is the median of the differences between observed values and the median of samples. It is less sensitive to the outliers. The interquartile range is the difference between the first quartile (the observation which has 0.25 or 25% of the observations below it) and the third quartile (the observation which has 0.25 of the observations above it). It is used in the construction of boxplots.

Standards error and confidence interval for the mean

Having the point estimation of the population parameter is one step. Next, we should know haw far our estimation is away from true value. The aim of this process is to know if our inference is robust to sampling variation. In other words, are we getting a consistent estimation as we keep repeating the sampling procedure?

First, lets see what happened if a random variable follows a normal distribution with mean \mu and standard deviation \sigma. In this case,

  • 50% of population falls between ’0.674’
  • 95% of population falls between ’1.960’
  • 99% of population falls between ’2.576’.

Usually, we deal with one set of observations called samples. if we keep repeating drawing samples from a given population and calculate the mean of each samples (each sample consists of a set of observation), the probability distribution of sampling mean (as you remember sampling mean is a random variable) has three main characteristics:

  • The sampling distribution of the mean of the observations with normal distribution is also normal.
  • As the sample size increases, the probability distribution of means of samples from any distribution will approach a normal distribution. This result is the basis of the Central Limit Theorem.
  • The expected value of the mean distribution will approach the true value of the population mean when the sample size goes to infinity.

The expected value of the standard deviation of sample means is defined as Standard error. The standard error of the sample mean defined as

standard error of sample mean

where the sigma in denominator denotes the standard deviation of the samples (observations) and n is the sample size. We are barely in a position of deriving the standard error from multiple samples. We usually derive the sample standard deviation from the single sample as follows:

Standard error from one single sample

where s is the sample estimate of the standard deviation of the original population and n is the sample size.

The standard error tells us about the variation in measuring the sample mean. If the standard error is large, then the sample mean of every sample will be different and different samples produce different means. So, each mean derived from a sample will not be close enough to the true population sample.

Confidence intervals for population mean

Now we know what the standard error is, and the sample mean follows the normal distribution, we can derive the confidence interval just like the way we do this for any normally distributed random variable.

We need to pick a right multiplier of standard error to have the confidence interval with a given probability. For example, if we want to see the interval to achieve 95% confidence interval we will have

95% CI

In practice, we barely know the true value of the standard deviation and need to estimate it through one single sample as mentioned before. To this end, we cannot use the normal distribution to calculate the confidence interval. Instead, we should use the t distribution to calculate confidence intervals for the population mean in the common situation of not knowing the population standard deviation.

95% CI for single samples

If the sample statistics does not follow the normal distribution, then we can use other methods such as resampling to derive the confidence interval. Among statistics that their distribution does not follow normal distribution is variance where it follows the chi-squared distribution. Like the random variable t, there are different probability distributions for chi-squared for different sample sizes; this is reflected in the degrees of freedom (n-1).

Methods for estimating parameters

Maximum Likelihood (ML)

the logic behind the ML is deceptively simple. Pick the value of a parameter which maximise the probability of observing those data. Imagine the we have samples of observations with a sample mean y^hat. The likelihood function, assuming a normal distribution and for a given standard deviation is the likelihood of observing the data for all possible values of \mu. In general, for a promoter \theta, the likelihood function is:

Likelihood fucntion

where f(yi;’\theta) is the joint probability distribution of yi and ’, i.e. the probability distribution of Y for possible values of \theta. The ML estimator of \theta is the one that maximises this likelihood function. The ML estimator is the one that maximises the likelihood function.

Ordinary least square (OLS)

Another general approach for estimating the parameters is by ordinary least squares (OLS). OLS tries to find a parameter value which minimises the sum of squared value of differences between observed data and the given parameter.

OLS for estimating f(\theta)

For example, the OLS estimator for mean value is the value where the OLS function is minimised. OLS estimators are usually more straightforward to calculate than ML estimators, always having exact arithmetical solutions. The major application of OLS estimation is when we are estimating parameters of linear models.

ML vs OLS estimation

In ML for estimating the desired parameter we need to assume a specific type of distribution for the observed data. When this condition is met, the ML estimator is usually biased and has the least variance for a reasonable sample size. In contrast, the OLS estimator does not impose such a restriction on the distribution of data and is usually unbiased and minimum variance. However, for interval estimation and hypothesis testing, OLS estimators have quite restrictive distribution assumptions related to normality and pattern of variance. OLS is inappropriate for some common models where the response variable(s) or the residuals are not distributed normally, e.g. binary and more general categorical data. Therefore, generalized linear modeling (GLMs such as logistic regression and log-linear models) and nonlinear modeling are based around ML estimation.

Resampling methods for estimation

The methods we have discussed so far for deriving the confidence interval for the parameters rely on two assumptions:

  • The sampling distribution of statistics usually assumed to be normal. We see that this is not the case for the variance.
  • The exact formula for the standard error is known. It is available just for some of well-known statistics such as mean and variance under specific conditions.

These conditions might hold for the mean statistics but not obviously for the median. In order to compensate for this issue, we should rely on the computer-intensive resampling method. These methods are based on one thing: the best guess for the distribution of population is the observations in our samples. We will discuss two methods here briefly.

Bootstrap

The bootstrap method is simple. We try to derive the sampling distribution by repetitively sampling with replacement usually with the same sample size of the original sample. Since the resampling is with replacement, each time the sample statistics is different. The bootstrap estimate of a given statistics is just the mean of the sampling distribution. Similarly, the standard error of the sampling statistics is the standard deviation of the bootstrapped samples.

Likewise, the confidence interval can be derived from the bootstrapped samples. There are three main methods in this regard:

  1. The percentile method where the confidence interval is directly obtained from the bootstrap distribution.
  2. Since the bootstrap distribution is usually asymmetric for the statistics other than mean such as median, the confidence interval would be biased. In order to rectify this issue, we first determine what percentage of the bootstrap samples are below a bootstrap mean. Then, derive the percentile point for obtained percentage based on the normal distribution. We call it z0. Then the confidence interval is (2z0–1.96, 2z0+1.96) for 95% CI.

Jackknife method

Jackknife is a historically earlier alternative to the bootstrap method. It is less computational intensive compared to the Bootstrap method. It can be implemented as follow. First, calculate the statistics using the full observed data (denoted as \theta*). Then from the sample with first data point removed (’\theta*1), then from the sample with second data point removed (’\theta*2), etc. Pseudovalues for each observation in the original sample are calculated as:

where \theta_(-i)* is the statistics calculated from the sample with observation i omitted. Each pseudo-value is simply a combination of two estimates of the statistic, one based on the whole sample and one based on the removal of a particular observation. The jackknife estimate of the parameter is simply the mean of the pseudo-values. The standard deviation of the jackknife estimate (the standard error of the estimate) is:

standard deviation of the jackknife estimate

Note that we have to assume that the pseudo-values are independent of each other for these calculations, whereas in reality they are not.

Conclusion

In this article, we overview the fundamentals of the estimation from statistical point of view. We introduce different statistics and sampling distributions. We have introduced the notion of confidence interval and discuss different methods regarding it derivation.

--

--

Vahid Naghshin

Technology enthusiast, Futuristic, Telecommunications, Machine learning and AI savvy, work at Dolby Inc.