Introduction

Missing data is a common challenge in real-world datasets, and handling it appropriately is crucial for building robust machine learning models. In this blog post, we’ll explore how to use the K-Nearest Neighbors (KNN) algorithm to impute missing values in a dataset. We’ll implement this using Python and popular libraries such as NumPy, Pandas, Seaborn, and Scikit-Learn.

Understanding the Problem

Let’s start by loading our dataset and examining its structure:

import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

# Load the dataset
import gdown
gdown.download('https://github.com/stedy/Machine-Learning-with-R-datasets/raw/master/usedcars.csv','file.csv',quiet=True)
data = pd.read_csv('file.csv')
data_orig = data.copy()

# Display the first few rows of the dataset
data.head()

Introducing Missing Data

To simulate missing data, we’ll randomly set a fraction of cells to be missing. In this example, we set 5% of the data as missing:

# Set a random seed for reproducibility
np.random.seed(42)

# Define the fraction of cells with missing data (e.g., 0.05 for 5% missing data)
missing_fraction = 0.05

# Calculate the number of cells to be set as missing
total_cells = data.size
num_missing = int(total_cells * missing_fraction)

# Generate random indices for missing data cells
missing_indices = np.random.choice(total_cells, num_missing, replace=False)

# Determine the row and column indices from the flattened DataFrame
num_rows, num_columns = data.shape
row_indices = missing_indices // num_columns
column_indices = missing_indices % num_columns

# Introduce missing data
for row, col in zip(row_indices, column_indices):
    data.iat[row, col] = np.nan

# Display the first 30 rows of the dataset with missing values
data.head(30)

Imputing Missing Data with K-Nearest Neighbors

Now, let’s use the KNN algorithm to impute the missing values. We’ll iterate through each column with missing values and predict the missing values using the KNN classifier:

data_missing = data.copy()

label_encoder = LabelEncoder()
knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Iterate through each column with missing values
for column in data.columns:
    missing_mask = data[column].isnull()
    if missing_mask.any():
        # Split the dataset into features (X) and targets (y)
        X_all = pd.get_dummies(data.drop(columns=[column]))
        y = data.loc[~missing_mask, column]

        # Fit the KNN classifier on the non-missing data
        knn_classifier.fit(X_all.loc[~missing_mask].fillna(-1), y)

        # Extract features for missing values
        X_miss = pd.get_dummies(X_all.loc[missing_mask])

        # Predict missing values and fill them in
        imputed_values = knn_classifier.predict(X_miss.fillna(-1))
        data.loc[missing_mask, column] = imputed_values

# Display the first 30 rows of the dataset after imputation
data.head(30)

Visualizing Imputation Results

We can visualize the imputation results using a heatmap to highlight the imputed values:

# Visualize missing values after imputation
sns.heatmap(data.isna(), vmin=0, vmax=1)

Comparing Imputed Values with Original Data

Finally, let’s compare the imputed values with the original data:

# Extract original, missing, and imputed values
missing_data = [data_missing.iloc[el[0], el[1]] for el in zip(row_indices, column_indices)]
filled_data = [data.iloc[el[0], el[1]] for el in zip(row_indices, column_indices)]
orig_data = [data_orig.iloc[el[0], el[1]] for el in zip(row_indices, column_indices)]
cols = [data_orig.columns[col] for col in column_indices]

# Create a DataFrame to display the comparison
comparison_df = pd.DataFrame(zip(cols, orig_data, missing_data, filled_data),
                             columns=['Feature', 'Original', 'Missing', 'Imputed'])

Now, comparison_df contains a comparison between the original, missing, and imputed values for the selected features. This can provide insights into the effectiveness of the imputation process.

In conclusion, using the K-Nearest Neighbors algorithm for imputing missing data is a powerful technique that leverages the similarity between data points to fill in the gaps. It’s important to experiment with different values of n_neighbors and handle categorical features appropriately for optimal results.