Standard deviation is a crucial statistical measure that quantifies the amount of variation or dispersion in a dataset. A high standard deviation indicates a wide spread of data points, while a low standard deviation suggests data points clustered closely around the mean. This blog post will guide you through calculating the standard deviation of a list in Python, using both built-in functions and manual calculation for a deeper understanding.
Using Python’s statistics
Module
The easiest and most efficient way to calculate the standard deviation in Python is by leveraging the statistics
module. This module provides the stdev()
function, which directly computes the population standard deviation. For a sample standard deviation (used when your list is a sample of a larger population), use stdev()
with xbar=mean(data)
import statistics
= [2, 4, 4, 4, 5, 5, 7, 9]
data
= statistics.stdev(data)
population_std_dev print(f"Population Standard Deviation: {population_std_dev}")
import numpy as np
= np.std(data, ddof=1) # ddof=1 for sample standard deviation
sample_std_dev print(f"Sample Standard Deviation: {sample_std_dev}")
# Population Standard Deviation: 2.138089935299395
# Sample Standard Deviation: 2.138089935299395
This code snippet first imports the statistics
module. Then, it defines a sample list data
. The statistics.stdev()
function calculates the population standard deviation, which is then printed to the console. The example also demonstrates finding sample standard deviation using numpy. Remember that the statistics
module’s stdev
calculates the population standard deviation, while many statistical analyses will require you to use the sample standard deviation.
Manual Calculation of Standard Deviation
While using the statistics
module is recommended for its simplicity and efficiency, understanding the underlying calculation provides valuable insight. The standard deviation is calculated in two main steps:
Calculate the mean: Sum all the numbers in the list and divide by the total count.
Calculate the variance: For each number, subtract the mean, square the result, and sum these squared differences. Divide this sum by the number of data points (for population standard deviation) or by the number of data points minus 1 (for sample standard deviation).
Calculate the standard deviation: Take the square root of the variance.
Let’s implement this manually:
import math
= [2, 4, 4, 4, 5, 5, 7, 9]
data
= len(data)
n = sum(data) / n
mean
= sum((x - mean) ** 2 for x in data) / n #Population variance
variance
#Sample Variance
= sum((x-mean)**2 for x in data) / (n-1)
sample_variance
= math.sqrt(variance) #Population Standard Deviation
std_dev = math.sqrt(sample_variance) #Sample Standard Deviation
sample_std_dev
print(f"Manually calculated Population Standard Deviation: {std_dev}")
print(f"Manually calculated Sample Standard Deviation: {sample_std_dev}")
# Manually calculated Population Standard Deviation: 2.0
# Manually calculated Sample Standard Deviation: 2.138089935299395
This code demonstrates the manual calculation, mirroring the steps outlined above. Note the difference in how the variance is calculated for the population and sample standard deviation.
Handling Large Datasets with NumPy
For extremely large datasets, NumPy offers significant performance advantages:
import numpy as np
= np.array([2, 4, 4, 4, 5, 5, 7, 9])
data
= np.std(data) #population std
population_std_dev = np.std(data, ddof=1) #sample std
sample_std_dev
print(f"NumPy Population Standard Deviation: {population_std_dev}")
print(f"NumPy Sample Standard Deviation: {sample_std_dev}")
# NumPy Population Standard Deviation: 2.0
# NumPy Sample Standard Deviation: 2.138089935299395
NumPy’s vectorized operations make calculations much faster, especially when dealing with substantial amounts of data. Remember that ddof=1
in np.std
is crucial for obtaining the sample standard deviation.