Imputing Missing Data with K-Nearest Neighbors Algorithm
Introduction
Missing data is a common challenge in real-world datasets, and handling it appropriately is crucial for building robust machine learning models. In this blog post, we’ll explore how to use the K-Nearest Neighbors (KNN) algorithm to impute missing values in a dataset. We’ll implement this using Python and popular libraries such as NumPy, Pandas, Seaborn, and Scikit-Learn.
Understanding the Problem
Let’s start by loading our dataset and examining its structure:
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
# Load the dataset
import gdown
gdown.download('https://github.com/stedy/Machine-Learning-with-R-datasets/raw/master/usedcars.csv','file.csv',quiet=True)
data = pd.read_csv('file.csv')
data_orig = data.copy()
# Display the first few rows of the dataset
data.head()
Introducing Missing Data
To simulate missing data, we’ll randomly set a fraction of cells to be missing. In this example, we set 5% of the data as missing:
# Set a random seed for reproducibility
np.random.seed(42)
# Define the fraction of cells with missing data (e.g., 0.05 for 5% missing data)
missing_fraction = 0.05
# Calculate the number of cells to be set as missing
total_cells = data.size
num_missing = int(total_cells * missing_fraction)
# Generate random indices for missing data cells
missing_indices = np.random.choice(total_cells, num_missing, replace=False)
# Determine the row and column indices from the flattened DataFrame
num_rows, num_columns = data.shape
row_indices = missing_indices // num_columns
column_indices = missing_indices % num_columns
# Introduce missing data
for row, col in zip(row_indices, column_indices):
data.iat[row, col] = np.nan
# Display the first 30 rows of the dataset with missing values
data.head(30)
Imputing Missing Data with K-Nearest Neighbors
Now, let’s use the KNN algorithm to impute the missing values. We’ll iterate through each column with missing values and predict the missing values using the KNN classifier:
data_missing = data.copy()
label_encoder = LabelEncoder()
knn_classifier = KNeighborsClassifier(n_neighbors=3)
# Iterate through each column with missing values
for column in data.columns:
missing_mask = data[column].isnull()
if missing_mask.any():
# Split the dataset into features (X) and targets (y)
X_all = pd.get_dummies(data.drop(columns=[column]))
y = data.loc[~missing_mask, column]
# Fit the KNN classifier on the non-missing data
knn_classifier.fit(X_all.loc[~missing_mask].fillna(-1), y)
# Extract features for missing values
X_miss = pd.get_dummies(X_all.loc[missing_mask])
# Predict missing values and fill them in
imputed_values = knn_classifier.predict(X_miss.fillna(-1))
data.loc[missing_mask, column] = imputed_values
# Display the first 30 rows of the dataset after imputation
data.head(30)
Visualizing Imputation Results
We can visualize the imputation results using a heatmap to highlight the imputed values:
# Visualize missing values after imputation
sns.heatmap(data.isna(), vmin=0, vmax=1)
Comparing Imputed Values with Original Data
Finally, let’s compare the imputed values with the original data:
# Extract original, missing, and imputed values
missing_data = [data_missing.iloc[el[0], el[1]] for el in zip(row_indices, column_indices)]
filled_data = [data.iloc[el[0], el[1]] for el in zip(row_indices, column_indices)]
orig_data = [data_orig.iloc[el[0], el[1]] for el in zip(row_indices, column_indices)]
cols = [data_orig.columns[col] for col in column_indices]
# Create a DataFrame to display the comparison
comparison_df = pd.DataFrame(zip(cols, orig_data, missing_data, filled_data),
columns=['Feature', 'Original', 'Missing', 'Imputed'])
Now, comparison_df
contains a comparison between the original, missing, and imputed values for the selected features. This can provide insights into the effectiveness of the imputation process.
In conclusion, using the K-Nearest Neighbors algorithm for imputing missing data is a powerful technique that leverages the similarity between data points to fill in the gaps. It’s important to experiment with different values of n_neighbors
and handle categorical features appropriately for optimal results.