Confidence Intervals: Statistical Fundamentals and Python Visualization

Introduction:

Confidence intervals play a pivotal role in statistical inference, providing a range within which the true parameter is likely to exist. This article will delve into the core statistical principles behind confidence intervals, accompanied by an in-depth exploration of Python code for visualization.

Statistical Foundations:

1. Understanding the Standard Normal Distribution:

At the heart of confidence intervals lies the standard normal distribution (\(Z\)). Let’s begin by examining the Probability Density Function (PDF) and Cumulative Distribution Function (CDF):

\[ f(x) = \frac{1}{\sqrt{2\pi}} \cdot e^{-\frac{x^2}{2}} \]

\[ F(x) = \int_{-\infty}^{x} f(t) \, dt \]

2. Z-Scores, Percentiles, and Significance Level (\(\alpha\)):

Z-scores are vital for quantifying deviations from the mean and constructing confidence intervals:

\[ Z = \frac{x - \mu}{\sigma} \]

Percentiles, intertwined with Z-scores, represent cumulative probabilities. The significance level (\(\alpha\)) denotes the probability of observing extreme values, often split into two tails (\(\alpha/2\)).

\[ P(X \leq x) = \text{CDF}(x) \]

Confidence Intervals:

3. Confidence Interval Bounds and (\alpha/2):

For a confidence level \(1 - \alpha\), the confidence interval for a population mean (\(\mu\)) incorporates the significance level. The critical value (\(Z\)) is adjusted for \(\alpha/2\) in each tail:

\[ \text{Lower Bound} = \bar{x} - Z_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]

\[ \text{Upper Bound} = \bar{x} + Z_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]

Here, \(s\) represents the standard deviation, and \(\frac{s}{\sqrt{n}}\) is the standard error. The standard error is used to account for the variability in the sample mean (\(\bar{x}\)) due to sampling.

Python Visualization Code:

Now, let’s transition to the Python code implementation to visually represent these statistical concepts.

Import Libraries

import numpy as np
import plotly.graph_objects as go
from scipy.stats import norm

Import necessary libraries for numerical operations (numpy), interactive plotting (plotly.graph_objects), and standard normal distribution functions (scipy.stats.norm).

Generate \(x\)-axis Values

# Generate a range of values for the x-axis
x = np.linspace(-5, 5, 1000)

Create a range of \(x\)-axis values from -5 to 5 using numpy.linspace.

Calculate PDF and CDF

# Calculate PDF and CDF
y_pdf = norm.pdf(x, 0, 1)
y_cdf = norm.cdf(x, 0, 1)

Calculate the Probability Density Function (PDF) and Cumulative Distribution Function (CDF) of the standard normal distribution.

Plot PDF and CDF

# Plot PDF and CDF
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y_pdf, mode='lines', name='PDF', line=dict(color='black')))
fig.add_trace(go.Scatter(x=x, y=y_cdf, mode='lines', name='CDF', line=dict(color='green'), yaxis='y2'))

Create a plot using plotly.graph_objects with two traces for PDF and CDF.

Calculate Z-scores for Percentiles

# Calculate the z-score corresponding to the 95th percentile
z_score_95th_percentile1 = norm.ppf((1-0.95)/2, 0, 1) # probability point function (ppf)
z_score_95th_percentile2 = norm.ppf(1-(1-0.95)/2, 0, 1)

z_score_99th_percentile1 = norm.ppf((1-0.99)/2, 0, 1)
z_score_99th_percentile2 = norm.ppf(1-(1-0.99)/2, 0, 1)

z_score_90th_percentile1 = norm.ppf((1-0.90)/2, 0, 1)
z_score_90th_percentile2 = norm.ppf(1-(1-0.90)/2, 0, 1)

Calculate the Z-score corresponding to the given significance level (\(\alpha\)).

Add Confidence Interval Lines

# Add vertical lines at the 90th, 95th and 99th percentiles
fig.add_shape(go.layout.Shape(
    type='line',
    x0=z_score_95th_percentile1,
    x1=z_score_95th_percentile1,
    y0=0,
    y1=norm.pdf(z_score_95th_percentile1, 0, 1),
    line=dict(color='red', dash='solid'),
    name='95th Percentile'
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=z_score_95th_percentile2,
    x1=z_score_95th_percentile2,
    y0=0,
    y1=norm.pdf(z_score_95th_percentile2, 0, 1),
    line=dict(color='red', dash='solid'),
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=z_score_99th_percentile1,
    x1=z_score_99th_percentile1,
    y0=0,
    y1=norm.pdf(z_score_99th_percentile1, 0, 1),
    line=dict(color='blue', dash='solid'),
    name='99th Percentile'
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=z_score_99th_percentile2,
    x1=z_score_99th_percentile2,
    y0=0,
    y1=norm.pdf(z_score_99th_percentile2, 0, 1),
    line=dict(color='blue', dash='solid'),
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=z_score_90th_percentile1,
    x1=z_score_90th_percentile1,
    y0=0,
    y1=norm.pdf(z_score_90th_percentile1, 0, 1),
    line=dict(color='orange', dash='solid'),
    name='90th Percentile'
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=z_score_90th_percentile2,
    x1=z_score_90th_percentile2,
    y0=0,
    y1=norm.pdf(z_score_90th_percentile2, 0, 1),
    line=dict(color='orange', dash='solid'),
))

Add vertical lines to the plot at positions corresponding to the Z-scores for \(\alpha/2\) in each tail, representing the confidence interval.

Update Layout and Display Plot

# Update layout
fig.update_layout(
    xaxis_title='Z-score',
    yaxis_title='Probability Density Function (PDF)',
    yaxis2=dict(title='Cumulative Probability', overlaying='y', side='right', color='green'),
    title='Standard Normal Distribution (Z-Distribution)',
    xaxis=dict(
        tickmode='linear',
        dtick=1
    ),
    showlegend=False
)

# Display the plot
fig.show()

Conclusion:

Mastering the statistical intricacies of confidence intervals involves navigating through PDFs, CDFs, Z-scores, and significance levels. The Python visualization code presented here bridges theory and practice, providing a tangible representation of confidence intervals in the context of the standard normal distribution.

t-distribution

The following code generates a plot of the probability density function (PDF) and cumulative distribution function (CDF) for a t-distribution with a specified degrees of freedom (df). The t-distribution is commonly used when dealing with small sample sizes or situations where the population standard deviation is unknown. It is particularly suitable for estimating the mean of a population when the sample size is small, and its tails are heavier than those of a normal distribution. The degrees of freedom parameter in the t-distribution accounts for the variability introduced by the smaller sample size. In the plot, vertical lines are added at percentiles (95th, 99th, and 90th) to illustrate critical t-scores, showcasing the distribution’s tails. This distribution is especially valuable in scenarios where the assumptions for a z-test (normal distribution and known population standard deviation) are not met.

import numpy as np
import plotly.graph_objects as go
from scipy.stats import t  # Import t-distribution instead of normal distribution

# Generate a range of values for the x-axis
x = np.linspace(-5, 5, 1000)

# Degrees of freedom for the t-distribution
degrees_of_freedom = 15  # You can adjust this value

# Calculate the corresponding y-values using the probability density function (pdf) of the t-distribution
y_pdf = t.pdf(x, degrees_of_freedom)
y_cdf = t.cdf(x, degrees_of_freedom)  # Calculate the cumulative distribution function (CDF)

# Plot the t-distribution PDF
fig = go.Figure()

fig.add_trace(go.Scatter(x=x, y=y_pdf, mode='lines', name='PDF', line=dict(color='black')))

# Plot the cumulative distribution function CDF on the second y-axis
fig.add_trace(go.Scatter(x=x, y=y_cdf, mode='lines', name='CDF', line=dict(color='green'), yaxis='y2'))

# Calculate the t-score corresponding to the 95th, 99th, and 90th percentiles
t_score_95th_percentile1 = t.ppf((1-0.95)/2, degrees_of_freedom)
t_score_95th_percentile2 = t.ppf(1-(1-0.95)/2, degrees_of_freedom)

t_score_99th_percentile1 = t.ppf((1-0.99)/2, degrees_of_freedom)
t_score_99th_percentile2 = t.ppf(1-(1-0.99)/2, degrees_of_freedom)

t_score_90th_percentile1 = t.ppf((1-0.90)/2, degrees_of_freedom)
t_score_90th_percentile2 = t.ppf(1-(1-0.90)/2, degrees_of_freedom)

# Add vertical lines at the 95th, 99th, and 90th percentiles
fig.add_shape(go.layout.Shape(
    type='line',
    x0=t_score_95th_percentile1,
    x1=t_score_95th_percentile1,
    y0=0,
    y1=t.pdf(t_score_95th_percentile1, degrees_of_freedom),
    line=dict(color='red', dash='solid'),
    name='95th Percentile'
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=t_score_95th_percentile2,
    x1=t_score_95th_percentile2,
    y0=0,
    y1=t.pdf(t_score_95th_percentile2, degrees_of_freedom),
    line=dict(color='red', dash='solid'),
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=t_score_99th_percentile1,
    x1=t_score_99th_percentile1,
    y0=0,
    y1=t.pdf(t_score_99th_percentile1, degrees_of_freedom),
    line=dict(color='blue', dash='solid'),
    name='99th Percentile'
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=t_score_99th_percentile2,
    x1=t_score_99th_percentile2,
    y0=0,
    y1=t.pdf(t_score_99th_percentile2, degrees_of_freedom),
    line=dict(color='blue', dash='solid'),
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=t_score_90th_percentile1,
    x1=t_score_90th_percentile1,
    y0=0,
    y1=t.pdf(t_score_90th_percentile1, degrees_of_freedom),
    line=dict(color='orange', dash='solid'),
    name='90th Percentile'
))

fig.add_shape(go.layout.Shape(
    type='line',
    x0=t_score_90th_percentile2,
    x1=t_score_90th_percentile2,
    y0=0,
    y1=t.pdf(t_score_90th_percentile2, degrees_of_freedom),
    line=dict(color='orange', dash='solid'),
))

fig.update_layout(
    xaxis_title='t-score',
    yaxis_title='Probability Density Function (PDF)',
    yaxis2=dict(title='Cumulative Probability', overlaying='y', side='right', color='green'),
    title=f't-Distribution, {degrees_of_freedom} df',
    xaxis=dict(
        tickmode='linear',
        dtick=1
    ),
    showlegend=False
)

# Display the plot
fig.show()