Understanding Data Binning in Python with NumPy and Pandas
Introduction
Data binning is a powerful technique in data analysis, allowing us to organize and gain insights from datasets effectively. In this exploration, we’ll dissect a Python script that utilizes NumPy and Pandas to implement two types of data binning: equal-width and equal-depth.
Generating Random Data
Let’s start by generating a random dataset using NumPy:
import numpy as np
data = np.random.randint(low=1, high=1200, size=20)
This dataset will be used to demonstrate the details of both equal-width and equal-depth binning techniques.
Sorting the Data
Before we delve into binning, let’s highlight the significance of sorting the dataset:
print(data)
print(np.sort(data))
Sorting the data is a crucial step that forms the basis for subsequent calculations involving bin boundaries. It grants us a clearer perspective on the distribution of data values.
Equal-Width Binning
Equal-width binning entails dividing the range of the dataset into intervals of equal width. Here’s a detailed breakdown of the steps involved:
# Calculate Bin Boundaries
num_bins_width = 5
bin_boundaries_width = np.linspace(min(data), max(data), num_bins_width + 1, endpoint=True)
# Digitize Data
bin_indices = np.digitize(data, bin_boundaries_width, right=False)
# Initialize Bins List
equal_width_bins = []
for i in range(1, num_bins_width + 1):
if i not in bin_indices:
bin_mean = np.nan
bin_median = np.nan
bin_boundary = []
bin_values = []
else:
bin_values = np.sort(data[bin_indices == i])
bin_mean = np.round(np.mean(bin_values), 2)
bin_median = np.median(bin_values)
bin_boundary = np.where(bin_values - bin_boundaries_width[i - 1] < bin_boundaries_width[i] - bin_values,
bin_boundaries_width[i - 1], bin_boundaries_width[i])
bin_boundary = np.where(bin_values - np.min(bin_values) < np.max(bin_values) - bin_values,
np.min(bin_values), np.max(bin_values))
equal_width_bins.append({
'Bin': i,
'Interval': (bin_boundaries_width[i - 1], bin_boundaries_width[i]),
'Data_Values': bin_values,
'Bin_Mean': bin_mean,
'Bin_Median': bin_median,
'Bin_Boundary_Smoothing': bin_boundary
})
df_equal_width_bins = pd.DataFrame(equal_width_bins)
df_equal_width_bins
Equal-Depth Binning
Equal-depth binning aims to create intervals with an equal number of data points. Here’s the corresponding code:
# Determine Bin Boundaries
num_bins_depth = 5
bin_size_depth = int(np.round(len(data) / num_bins_depth, 0))
sorted_data = np.sort(data)
bin_boundaries_depth = [sorted_data[i * bin_size_depth] for i in range(num_bins_depth)] + [max(data)]
# Digitize Data
bin_indices = np.digitize(data, bin_boundaries_depth, right=False)
# Initialize Bins List
equal_depth_bins = []
for i in range(1, num_bins_depth + 1):
if i not in bin_indices:
bin_mean = np.nan
bin_median = np.nan
bin_boundary = []
bin_values = []
else:
bin_values = np.sort(data[bin_indices == i])
bin_mean = np.round(np.mean(bin_values), 2)
bin_median = np.median(bin_values)
bin_boundary = np.where(bin_values - bin_boundaries_depth[i - 1] < bin_boundaries_depth[i] - bin_values,
bin_boundaries_depth[i - 1], bin_boundaries_depth[i])
bin_boundary = np.where(bin_values - np.min(bin_values) < np.max(bin_values) - bin_values,
np.min(bin_values), np.max(bin_values))
equal_depth_bins.append({
'Bin': i,
'Interval': (bin_boundaries_depth[i - 1], bin_boundaries_depth[i]),
'Data_Values': bin_values,
'Bin_Mean': bin_mean,
'Bin_Median': bin_median,
'Bin_Boundary_Smoothing': bin_boundary
})
df_equal_depth_bins = pd.DataFrame(equal_depth_bins)
df_equal_depth_bins
Summary
The fundamental difference between equal-width and equal-depth binning lies in how the intervals are defined. Equal-width ensures consistent data value ranges, while equal-depth maintains a consistent number of data points per interval.
Matplotlib is a commonly used plotting library in Python known for its versatility. We will use it to generate histograms, visually representing how data is distributed within each bin. The following code snippet demonstrates this process.
import matplotlib.pyplot as plt
# Plot histograms
plt.figure(figsize=(12, 6))
# Equal-width binning
plt.subplot(1, 2, 1)
plt.hist(data, bins=bin_boundaries_width, edgecolor='black')
plt.title('Equal-Width Binning')
plt.xlabel('Data Values')
plt.ylabel('Frequency')
for i, bin_boundary in enumerate(bin_boundaries_width):
plt.axvline(bin_boundary, color='r', linestyle='dashed', linewidth=1)
plt.tight_layout()
# Equal-depth binning
plt.subplot(1, 2, 2)
plt.hist(data, bins=bin_boundaries_depth, edgecolor='black')
plt.title('Equal-Depth Binning')
plt.xlabel('Data Values')
plt.ylabel('Frequency')
for i, bin_boundary in enumerate(bin_boundaries_depth):
plt.axvline(bin_boundary, color='r', linestyle='dashed', linewidth=1)
plt.tight_layout()
plt.show()