Sujay Kapadnis
Writing

Central Limit Theorem


Have you ever participated in a poll? or any process that may have collected data that represents you and also from other variety of people. Today’s topic will explain intution behind polling.

Understanding the Central Limit Theorem (CLT)

Consider the following scenario.

There’s a marathon race arranged in your country and you are the organizer of this grand event. Runners from all over the world are going to participate in it and have already arrived in your country.

On the day of the event, all the participants are being transported to the venue but one of the buses broke down on the way, and unfortunately, the bus is full of foreigners and hence can’t communicate with people nearby.

Now to resolve this matter, you yourself set off on the road with your team, and lucky for you, you immediately found a bus surrounded by a group of unhappy foreigners. Lucky for you, you get the information about those passengers including their weights and you notice that the mean weight of passengers on that bus is nearly 220 pounds (~99KG)(Anything is possible when you know statistics). There is no way that a random group of marathon runners could all be this heavy. Then you immediately inform your team to keep searching.

Congratulations! If you can grasp how someone who takes a quick look at the weights of passengers on a bus can infer that they are probably not on their way to the starting line of a marathon, then you now understand the basic idea of the central limit theorem.

The core principle of the central limit theorem is that a large, properly drawn sample will always represent the population from which it was drawn. Well, obviously it won’t be an exact replica of the population and will vary from it, but the probability that it will deviate massively from the population is very low.

There’s still a possibility that people weighing 220 pounds do run in a marathon and out of the total population, there might be almost 100 of them, but the likelihood that so many of them being assigned to the same bus is extremely low, hence you can confidently conclude that it’s not the bus you were looking for.

This is the basic intuition of CLT. The mean weight of a marathon runner is about 155 pounds, so there’s less than a 1 in 100 chance that the mean weight of those 60 passengers is 220.

Just now we had enough idea about the population data (the mean weight of a runner is 155 pounds) with its help we concluded that its the wrong bus. And of course, its inverse is also true and that’s how polling works.

In reality, it's impossible to complete a population dataset and that's why sufficiently large sample datasets are used based on which its parent population dataset is assumed. Here’s when CLT comes into the picture.

The central limit theorem tells us that a large sample will not typically deviate sharply from its underlying population. A mere poll of 2000 appropriately chosen people can tell a great deal about how the entire country is thinking.

The central limit theorem tells us that the sample means will be distributed roughly as a normal distribution around the population mean. The population from which the samples are being drawn does not have to have a normal distribution in order for the sample means to be distributed normally.

The larger the number of samples, the more closely the distribution will approximate the normal distribution. And the larger the size of each sample, the tighter that distribution will be.

Summary:-

  1. If you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean.
  2. Most sample means will lie reasonably close to the population mean, the standard error is what defines “reasonably close.”
  3. The central limit theorem tells us the probability that a sample mean will lie within a certain distance of the population mean. It is relatively unlikely that a sample mean will lie more than two standard errors from the population mean, and extremely unlikely that it will lie three or more standard errors from the population mean.