Confidence Intervals: Statistical Fundamentals and Python Visualization
Introduction:
Confidence intervals play a pivotal role in statistical inference, providing a range within which the true parameter is likely to exist. This article will delve into the core statistical principles behind confidence intervals, accompanied by an in-depth exploration of Python code for visualization.
Statistical Foundations:
1. Understanding the Standard Normal Distribution:
At the heart of confidence intervals lies the standard normal distribution (\(Z\)). Let’s begin by examining the Probability Density Function (PDF) and Cumulative Distribution Function (CDF):
\[ f(x) = \frac{1}{\sqrt{2\pi}} \cdot e^{-\frac{x^2}{2}} \]
\[ F(x) = \int_{-\infty}^{x} f(t) \, dt \]
2. Z-Scores, Percentiles, and Significance Level (\(\alpha\)):
Z-scores are vital for quantifying deviations from the mean and constructing confidence intervals:
\[ Z = \frac{x - \mu}{\sigma} \]
Percentiles, intertwined with Z-scores, represent cumulative probabilities. The significance level (\(\alpha\)) denotes the probability of observing extreme values, often split into two tails (\(\alpha/2\)).
\[ P(X \leq x) = \text{CDF}(x) \]
Confidence Intervals:
3. Confidence Interval Bounds and (\alpha/2):
For a confidence level \(1 - \alpha\), the confidence interval for a population mean (\(\mu\)) incorporates the significance level. The critical value (\(Z\)) is adjusted for \(\alpha/2\) in each tail:
\[ \text{Lower Bound} = \bar{x} - Z_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]
\[ \text{Upper Bound} = \bar{x} + Z_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]
Here, \(s\) represents the standard deviation, and \(\frac{s}{\sqrt{n}}\) is the standard error. The standard error is used to account for the variability in the sample mean (\(\bar{x}\)) due to sampling.
Python Visualization Code:
Now, let’s transition to the Python code implementation to visually represent these statistical concepts.
Import Libraries
import numpy as np
import plotly.graph_objects as go
from scipy.stats import norm
Import necessary libraries for numerical operations (numpy), interactive plotting (plotly.graph_objects), and standard normal distribution functions (scipy.stats.norm).
Generate \(x\)-axis Values
# Generate a range of values for the x-axis
x = np.linspace(-5, 5, 1000)
Create a range of \(x\)-axis values from -5 to 5 using numpy.linspace.
Calculate PDF and CDF
# Calculate PDF and CDF
y_pdf = norm.pdf(x, 0, 1)
y_cdf = norm.cdf(x, 0, 1)
Calculate the Probability Density Function (PDF) and Cumulative Distribution Function (CDF) of the standard normal distribution.
Plot PDF and CDF
# Plot PDF and CDF
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y_pdf, mode='lines', name='PDF', line=dict(color='black')))
fig.add_trace(go.Scatter(x=x, y=y_cdf, mode='lines', name='CDF', line=dict(color='green'), yaxis='y2'))
Create a plot using plotly.graph_objects with two traces for PDF and CDF.
Calculate Z-scores for Percentiles
# Calculate the z-score corresponding to the 95th percentile
z_score_95th_percentile1 = norm.ppf((1-0.95)/2, 0, 1) # probability point function (ppf)
z_score_95th_percentile2 = norm.ppf(1-(1-0.95)/2, 0, 1)
z_score_99th_percentile1 = norm.ppf((1-0.99)/2, 0, 1)
z_score_99th_percentile2 = norm.ppf(1-(1-0.99)/2, 0, 1)
z_score_90th_percentile1 = norm.ppf((1-0.90)/2, 0, 1)
z_score_90th_percentile2 = norm.ppf(1-(1-0.90)/2, 0, 1)
Calculate the Z-score corresponding to the given significance level (\(\alpha\)).
Add Confidence Interval Lines
# Add vertical lines at the 90th, 95th and 99th percentiles
fig.add_shape(go.layout.Shape(
type='line',
x0=z_score_95th_percentile1,
x1=z_score_95th_percentile1,
y0=0,
y1=norm.pdf(z_score_95th_percentile1, 0, 1),
line=dict(color='red', dash='solid'),
name='95th Percentile'
))
fig.add_shape(go.layout.Shape(
type='line',
x0=z_score_95th_percentile2,
x1=z_score_95th_percentile2,
y0=0,
y1=norm.pdf(z_score_95th_percentile2, 0, 1),
line=dict(color='red', dash='solid'),
))
fig.add_shape(go.layout.Shape(
type='line',
x0=z_score_99th_percentile1,
x1=z_score_99th_percentile1,
y0=0,
y1=norm.pdf(z_score_99th_percentile1, 0, 1),
line=dict(color='blue', dash='solid'),
name='99th Percentile'
))
fig.add_shape(go.layout.Shape(
type='line',
x0=z_score_99th_percentile2,
x1=z_score_99th_percentile2,
y0=0,
y1=norm.pdf(z_score_99th_percentile2, 0, 1),
line=dict(color='blue', dash='solid'),
))
fig.add_shape(go.layout.Shape(
type='line',
x0=z_score_90th_percentile1,
x1=z_score_90th_percentile1,
y0=0,
y1=norm.pdf(z_score_90th_percentile1, 0, 1),
line=dict(color='orange', dash='solid'),
name='90th Percentile'
))
fig.add_shape(go.layout.Shape(
type='line',
x0=z_score_90th_percentile2,
x1=z_score_90th_percentile2,
y0=0,
y1=norm.pdf(z_score_90th_percentile2, 0, 1),
line=dict(color='orange', dash='solid'),
))
Add vertical lines to the plot at positions corresponding to the Z-scores for \(\alpha/2\) in each tail, representing the confidence interval.
Update Layout and Display Plot
# Update layout
fig.update_layout(
xaxis_title='Z-score',
yaxis_title='Probability Density Function (PDF)',
yaxis2=dict(title='Cumulative Probability', overlaying='y', side='right', color='green'),
title='Standard Normal Distribution (Z-Distribution)',
xaxis=dict(
tickmode='linear',
dtick=1
),
showlegend=False
)
# Display the plot
fig.show()
Conclusion:
Mastering the statistical intricacies of confidence intervals involves navigating through PDFs, CDFs, Z-scores, and significance levels. The Python visualization code presented here bridges theory and practice, providing a tangible representation of confidence intervals in the context of the standard normal distribution.
t-distribution
The following code generates a plot of the probability density function (PDF) and cumulative distribution function (CDF) for a t-distribution with a specified degrees of freedom (df). The t-distribution is commonly used when dealing with small sample sizes or situations where the population standard deviation is unknown. It is particularly suitable for estimating the mean of a population when the sample size is small, and its tails are heavier than those of a normal distribution. The degrees of freedom parameter in the t-distribution accounts for the variability introduced by the smaller sample size. In the plot, vertical lines are added at percentiles (95th, 99th, and 90th) to illustrate critical t-scores, showcasing the distribution’s tails. This distribution is especially valuable in scenarios where the assumptions for a z-test (normal distribution and known population standard deviation) are not met.
import numpy as np
import plotly.graph_objects as go
from scipy.stats import t # Import t-distribution instead of normal distribution
# Generate a range of values for the x-axis
x = np.linspace(-5, 5, 1000)
# Degrees of freedom for the t-distribution
degrees_of_freedom = 15 # You can adjust this value
# Calculate the corresponding y-values using the probability density function (pdf) of the t-distribution
y_pdf = t.pdf(x, degrees_of_freedom)
y_cdf = t.cdf(x, degrees_of_freedom) # Calculate the cumulative distribution function (CDF)
# Plot the t-distribution PDF
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y_pdf, mode='lines', name='PDF', line=dict(color='black')))
# Plot the cumulative distribution function CDF on the second y-axis
fig.add_trace(go.Scatter(x=x, y=y_cdf, mode='lines', name='CDF', line=dict(color='green'), yaxis='y2'))
# Calculate the t-score corresponding to the 95th, 99th, and 90th percentiles
t_score_95th_percentile1 = t.ppf((1-0.95)/2, degrees_of_freedom)
t_score_95th_percentile2 = t.ppf(1-(1-0.95)/2, degrees_of_freedom)
t_score_99th_percentile1 = t.ppf((1-0.99)/2, degrees_of_freedom)
t_score_99th_percentile2 = t.ppf(1-(1-0.99)/2, degrees_of_freedom)
t_score_90th_percentile1 = t.ppf((1-0.90)/2, degrees_of_freedom)
t_score_90th_percentile2 = t.ppf(1-(1-0.90)/2, degrees_of_freedom)
# Add vertical lines at the 95th, 99th, and 90th percentiles
fig.add_shape(go.layout.Shape(
type='line',
x0=t_score_95th_percentile1,
x1=t_score_95th_percentile1,
y0=0,
y1=t.pdf(t_score_95th_percentile1, degrees_of_freedom),
line=dict(color='red', dash='solid'),
name='95th Percentile'
))
fig.add_shape(go.layout.Shape(
type='line',
x0=t_score_95th_percentile2,
x1=t_score_95th_percentile2,
y0=0,
y1=t.pdf(t_score_95th_percentile2, degrees_of_freedom),
line=dict(color='red', dash='solid'),
))
fig.add_shape(go.layout.Shape(
type='line',
x0=t_score_99th_percentile1,
x1=t_score_99th_percentile1,
y0=0,
y1=t.pdf(t_score_99th_percentile1, degrees_of_freedom),
line=dict(color='blue', dash='solid'),
name='99th Percentile'
))
fig.add_shape(go.layout.Shape(
type='line',
x0=t_score_99th_percentile2,
x1=t_score_99th_percentile2,
y0=0,
y1=t.pdf(t_score_99th_percentile2, degrees_of_freedom),
line=dict(color='blue', dash='solid'),
))
fig.add_shape(go.layout.Shape(
type='line',
x0=t_score_90th_percentile1,
x1=t_score_90th_percentile1,
y0=0,
y1=t.pdf(t_score_90th_percentile1, degrees_of_freedom),
line=dict(color='orange', dash='solid'),
name='90th Percentile'
))
fig.add_shape(go.layout.Shape(
type='line',
x0=t_score_90th_percentile2,
x1=t_score_90th_percentile2,
y0=0,
y1=t.pdf(t_score_90th_percentile2, degrees_of_freedom),
line=dict(color='orange', dash='solid'),
))
fig.update_layout(
xaxis_title='t-score',
yaxis_title='Probability Density Function (PDF)',
yaxis2=dict(title='Cumulative Probability', overlaying='y', side='right', color='green'),
title=f't-Distribution, {degrees_of_freedom} df',
xaxis=dict(
tickmode='linear',
dtick=1
),
showlegend=False
)
# Display the plot
fig.show()