Essential Probability

Part V- Sampling and the Normal Distribution

Samples from a Distribution

Simulating Random Samples

Sample Sums, Sample Means, Sample Variances

Sample Mean and Variance as Estimators

The Variance of the Sample Average and Sample Sum

Standardizing a Random Variable

The Sample Average and Sum for a Normal Distribution

The Central Limit Theorem

The Normal Approximation to the Binomial Distribution

Samples from a Distribution

Let F denote any given distribution. A random sample of size n from F is a sequence X₁, X₂, ..., X_n of random variables such that

1. X₁, X₂, ..., X_n are independent, and

2. The distribution of each individual variable X_i is the given distribution F.

In statistics it is usually assumed that the data at hand are values of a random sample from some distribution. The distribution F is usually unknown, at least partially, and the goal is to make inferences about it from the data.

The term "random sample" has a slightly different meaning here than it does when speaking of a random sample from a population.

Simulating Random Samples

The values of a random sample U₁, U₂, ..., U_n from the standard uniform distribution on the interval (0, 1) may be simulated easily. Simply punch the random number key on your calculator n times. Once the values of U₁, U₂, ..., U_n are simulated, the values of a random sample X₁, X₂, ..., X_n from the cumulative distribution function F_X can be simulated by letting each X_i be the smallest solution x of the inequality U_i ≤ F_X(x). For example, a random sample from the uniform distribution on the interval (a, b) can be obtained by letting X_i = a + (b-a)U_i.

Sample Sums, Sample Means, Sample Variances

Let X₁, X₂, ..., X_n be a random sample from a given distribution. Suppose this distribution has mean μ and variance σ₂. From the random sample, we derive four new random variables of great imporance:

These are random variables. The last three should not be confused with the corresponding parameters μ, σ² and σ, which are not random variables.

Sample Mean and Variance as Estimators

The sample sum, mean, variance and standard deviation are random variables and have distributions derived from the distribution sampled. As such, they have expected values. One of the most important properties of the sample mean and variance is that they are unbiased estimators of the corresponding population parameters μ and σ². This means that

If the population mean and variance are unknown, they can be estimated from the data by the sample mean and variance. "On average" the estimated values will be right. The sample standard deviation is not, in general, an unbiased estimator of the population standard deviation. However, it is still useful as an estimator of the population standard deviation.

The Variance of the Sample Average and Sample Sum

The expected value of a sum of random variables is the sum of their expected values. Thus, for the sample sum,

Because of the independence of the summands, the variances add in the same way.

It follows from the last equation that the variance of the sample average is

and its standard deviation is

The standard deviation of the sample sum is

Standardizing a Random Variable

If X is a random variable with mean μ and standard deviation σ, the standardized value of X is the random variable

The random variable Z is sometimes called the z-score of X. Its mean and variance are, respectively, 0 and 1. The standardized values of the sample sum and the sample average are algebraically equivalent because the sample sum is simply a multiple of the sample average. They are

For comparison, the standardized value of a binomial random variable Y based on n trials and success probability θ is

The Sample Average and Sum for a Normal Distribution

When the samples X₁, ..., X_n are from a normal distribution with mean μ and standard deviation σ, the sample average and the sample sum are both normally distributed. The sample average has the normal distribution with mean μ and standard deviation σ/√n. The sample sum is normally distributed with mean nμ and standard deviation σ√n. Their standardized value has the standard normal distribution with mean 0 and standard deviation 1.

The Central Limit Theorem

The sample average is normally distributed if the samples are from a normal distribution. The most important theorem in probability is the central limit theorem, which asserts that it is approximately normally distributed even when the samples are from a non-normal distribution, provided that the sample size is large enough. More precisely, as the sample size n gets large without bound, the distribution of the standardized sample average approaches the standard normal distribution. For any given numbers a and b, with a < b,

The Normal Approximation to the Binomial Distribution

A binomial random variable is a sample sum Y = X₁ + X₂ + ... + X_n, where the X_i are independent Bernoulli trials. Therefore, the central limit theorem applies and for large enough n the distribution of the standardized value of Y

is approximately standard normal. As an illustration of how this might be useful, suppose a six-sided die is thrown 100 times and we want to know the probability that a 2 occurs more than 20 times. If Y is the number of occurrences of a 2 in the n=100 trials, we are after P[Y > 20]. The success probability θ is 1/6. With some algebra and the central limit theorem,

where Z has a standard normal distribution. From a table of the standard normal distribution we find that P[Z > 0.8944] = 0.1855. The exact value to four decimal places of the probability of more than 20 successes is 0.1519. One of the sources of inaccuracy is that we are trying to approximate a discrete distribution (the binomial) by a continuous distribution (the normal). The normal approximation can be improved by using what is known as the continuity correction. Observe that since Y is a whole number, the events [Y > 20] and [Y > 20.5] are the same. Replacing 20 by 20.5 in the calculations above results in an approximate value of P[Z > 1.0286] = 0.1518.