Two Statistical Theories that Every Software Engineer Should know

Guy Alster
9 min readDec 28, 2023

--

Introduction

I’ve always believed that Machine Learning and AI did to statistics what the iPhone did to the cellphones. In both cases the newcomers took concepts that some would argue were mundane, and turned them into new hypes. In the case of statistics, there is no argument that it is has always been one of the most important mathematical branches. However, in the era of “Big Data”, its significance grew even more. In today’s data-driven world, a basic understanding of probability theory and statistics is not just a valuable skill but a necessity for efficient and productive work, especially in engineering fields.

Most individuals with a STEM background have encountered basic statistics in their education. Even those outside this sphere are likely familiar with concepts such as mean, standard deviation, distributions, expectation, and percentiles. If these terms are new to you, don’t worry. There’s a wealth of resources available that explain these concepts in an accessible manner, even for those without a strong mathematical foundation.

In this article, I aim to delve into two pivotal statistical theories that are as essential to the field of statistics as lasagna and pizza are to Italian cuisine. The first is the Law of Large Numbers, and the second is the Central Limit Theorem. While each theory is valuable in its own right, their combined power has been harnessed by statisticians for decades in various applications, including surveys, data analysis, and estimations.

The Basics

Both the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN) begin with a population or a dataset, from which we aim to measure certain statistics. Often, direct measurement of these statistics is impractical due to reasons such as the sheer size of the population or dataset. Two key statistics of interest are the mean and the standard deviation of a particular measurement related to the data.

  • Mean: The mean provides a measure of the central tendency of a dataset, which helps us understand the general behavior of the data or population.
  • Standard Deviation: This statistic assesses the spread of the data around the mean. It indicates how “reliable” or representative the mean is of the entire dataset. A higher standard deviation means the data is more spread out, making the mean less representative. Conversely, a smaller standard deviation suggests that the mean is a more accurate reflection of the dataset.

Illustrative Example:

Consider a scenario where a hundred people in a bar each have an annual salary of $50,000. Calculating the average salary in this case is straightforward — it’s $50,000, with a standard deviation of zero, indicating no variability. Now, imagine Bill Gates, with an annual salary of $1 billion, enters the bar. The average salary for the 101 people in the bar now skews to approximately $9,950,495. This drastic change is due to the presence of an outlier (Bill Gates), which significantly increases the standard deviation to around 9.90049504950495e7. This example demonstrates how a single outlier can dramatically impact our statistical measures, emphasizing the need for careful interpretation and understanding of context.

Importance of Sampling: Another crucial concept in both theories is sampling. When it’s impractical to measure the entire population, statisticians use sampling as a strategy. Sampling involves extracting smaller, manageable groups from the larger population to perform measurements. If these samples are sufficiently large and can be considered independent and identically distributed (IID), the process of repeated sampling and measurement can yield significant results. These results can then be used to estimate the statistics of the overall group, which is our primary objective.

The Law of Large Numbers

The law of large numbers is the easier to define. The theory states that given a population/dataset with some distribution and a mean. If you take a large enough sample of it, the mean of that sample tends to estimate the overall population’s mean. Moreover, the larger the sample size is, the better its mean estimates the mean of the population. Key to the application of this law is the representativeness of the sample. The sample must be a good representative of the population, meaning it should be randomly selected and not skewed or biased. If the sample is biased, the sample mean will not accurately estimate the population mean, regardless of the sample size.

It’s important to note that the LLN applies to independent and identically distributed random variables. This means each data point in the sample should be independent of others and drawn from the same distribution. In practical terms, this often involves random sampling methods to ensure that every member of the population has an equal chance of being included in the sample.

In essence, the Law of Large Numbers assures us that with a sufficiently large and well-chosen sample, we can reliably estimate characteristics of an entire population. This principle is foundational in many fields that rely on statistical analysis, from social sciences to engineering.

Practical applications of the LLN

The Law of Large Numbers is essential because it provides a foundation for making reliable inferences from sample data to larger populations and hence is used in many fields. Some of its applications are:

  1. Statistical Sampling: The LLN is the basis for statistical sampling methods. It assures that as the sample size increases, the sample mean (or other statistics) will converge to the population mean. This principle is crucial in fields like survey research, market analysis, and opinion polling, where it’s often impractical to study an entire population.
  2. Financial Markets: In finance, the LLN underpins many risk assessment and pricing models. For example, it’s used in calculating the expected returns of investments over time. As the number of investment periods increases, the average return is expected to approach the expected value.
  3. Insurance and Actuarial Science: The LLN is fundamental in insurance and actuarial science for predicting future claims and setting premiums. It helps in understanding that as the number of policyholders increases, the average cost per policyholder becomes more predictable and closer to the expected value.
  4. Data Science and Machine Learning: In data science, the LLN supports the idea that models trained on larger datasets will more accurately reflect the underlying patterns and relationships in the population data, leading to more reliable and generalizable models
  5. Medical Research: In clinical trials and epidemiological studies, the LLN is used to determine the effectiveness of treatments or the prevalence of diseases. Large sample sizes help ensure that the results are representative of the larger population.

The Central Limit Theory

This second important theory as before also starts with a large population which we want to measure some statistics of. We might not know the distribution of the values we are trying to measure nor their mean or their standard deviation. The CLT states that if we sample the population multiple times such that each sample is IID and is large enough (at least 30 values) and we take the means of each sample, the means themselves approximate a Normal distribution (bell-shaped curve). This is a significant theory and very powerful because despite the fact that we don’t know anything about the original population and the distribution of its values, we do know that the means of the samples from that population form a well known distribution which we can get statistics of.

Two Practical applications of the CLT

Hypothesis testing: A testing tool in research and data analysis, allowing for objective decision-making based on statistical evidence. It is widely used in Medicine, psychology, economics and engineering to test theories and evaluate effectiveness of changes. I won’t delve into all the details of how it is performed but will mention the general idea. In hypothesis testing, an experiment is conducted with the aim of assessing the strength of evidence against the null hypothesis. This null hypothesis typically posits that the experiment will not result in any significant change or effect. You run the experiment and collect data. Based on the results, you determine whether the observed data significantly deviate from what would be expected under the null hypothesis. If the observed data is highly unlikely assuming the null hypothesis is true, the null hypothesis is rejected. This rejection leads to considering an alternative hypothesis that better aligns with the observed data. The central Limit Theorem plays a crucial role in Hypothesis testing. The CLT assures us that, for a sufficiently large sample size, the distribution of sample means approximates a normal distribution, regardless of the population’s original distribution. This property is crucial because it allows for the application of normal distribution-based tests (like z-tests and t-tests) even when the population distribution is unknown or not normally distributed.

Example: You have an internet service and you would like to reduce the average latency of the service. Historical data shows that the average latency of the service is 120ms, with a known standard deviation of 20ms. Your objective is to determine whether a new optimization that you would like to add to your service can significantly reduce the latency.

Your null hypothesis postulates that the optimization will not change the mean latency. Your alternative hypothesis postulates that the optimization will reduce the latency.

You implement the optimization and start collecting data. You randomly select 50 different instances to measure latency and you find that their average latency is 110ms. In addition you decide on a significance level of 5%. This represents the threshold for determining the statistical significance of the test results. In other words its represents the probability of making type I error which rejects the null hypothesis when it is actually true.

You calculate the Z-Statistic and get a p-value which measures how likely it is to observe the data (or something more extreme) if the null hypothesis is correct. In our case you find the p value to be 0.01. This means that there is only 1% chance of observing a sample mean of 110ms if the true mean latency is still 120ms as stated by the null hypothesis.

Since the p value is less than or equal to the significance level you conclude that the results are statistically significant and the null hypothesis is rejected. Hence you adopt your new optimization and conclude that it is indeed working well.

Data Science & Machine Learning: In the context of DS and ML, CLT is valuable in multiple ways.

  1. Statistical Inference in Data Science: Data scientists often work with samples of data rather than entire populations. The CLT allows them to make inferences about the population from which the sample was drawn. For instance, when estimating population parameters like means or proportions, the CLT assures that the distribution of these estimates will be approximately normal if the sample size is large enough, even if the population distribution is not normal. That enables them to use statistical tools like confidence intervals which help them understand the precision and the reliability of their estimates. For example, if a data scientist estimates the average user engagement time on a website, they can provide a range (confidence interval) within which the true average likely falls. Assuming Normal distribution also simplifies the analysis of the data and the decision making regarding it.
  2. Assumption of Normality in Algorithms: Many machine learning algorithms, especially those involving statistical methods, operate under the assumption that the data are normally distributed. The CLT provides a theoretical foundation for this assumption, particularly when dealing with averages or sums of variables. This is crucial in algorithms that are sensitive to the distribution of the data, such as linear regression, logistic regression, and other parametric methods.
  3. Feature Engineering: Data scientists often create new features (variables) from existing data. The CLT can be applied when combining multiple variables or transforming them. For example, if a new feature is created as the average of several other variables, the CLT suggests that this new feature would tend to have a normal distribution, which might be beneficial for certain types of analysis.
  4. Model Evaluation and Validation: When evaluating machine learning models, especially through techniques like cross-validation, the CLT helps in understanding the distribution of evaluation metrics (like accuracy, precision, recall). This is important for assessing the reliability of these metrics and for comparing different models.
  5. Handling Large Datasets: In the era of big data, data scientists often work with very large datasets. The CLT is particularly useful here because it assures that, as the sample size increases, the sampling distribution of the mean (or sum) of a variable becomes increasingly normal. This property can simplify analysis and the application of statistical methods.

Summary

In conclusion, this article has explored two fundamental statistical theories crucial for software engineers in the era of Big Data: the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). The LLN is essential for understanding that larger sample sizes lead to more accurate estimates of population parameters, making it a cornerstone in fields ranging from financial markets to medical research. The CLT, on the other hand, is pivotal in hypothesis testing and data analysis, providing a basis for assuming normal distribution in various statistical applications, even when dealing with non-normal population distributions. These theories not only facilitate a deeper understanding of data behavior but also enhance the precision and reliability of data analysis in software engineering. Embracing these concepts can significantly elevate the analytical capabilities of professionals in this field, leading to more informed and effective decision-making processes.

--

--