How Does the K-Nearest Neighbors (KNN) Algorithm Work in Python?

AI, ARTIFICIAL INTELLIGENCE

Today, I will try to describe one of the most intuitive and fascinating classification algorithms in its simplicity: the K-Nearest Neighbors, known as KNN.

KNN is based on a simple yet powerful concept: “Tell me who you’re with, and I’ll tell you who you are.” In practical terms, it classifies a new data point based on the k nearest data points in the training set, using Euclidean distance as the metric.

The Nearest Neighbors

Imagine you have a dataset that represents various animals in a zoo, with information like weight, height, and age. When a new animal is added, KNN checks which k animals are closest in terms of Euclidean distance and uses this information to classify it. It’s a simple but highly effective approach.

Euclidean Distance

Euclidean distance represents the “straight line” between two points in Euclidean space. Mathematically, the distance between two points P(x1, y1) and Q(x2, y2) is calculated as:

sqrt((x2 - x1)2 + (y2 - y1)2)

This method easily extends to multi-dimensional spaces, making it suitable for datasets with many features.

Implementation in Python

Let’s see how to put KNN into practice using scikit-learn, a Python library that simplifies the implementation of machine learning algorithms.

# Import the necessary libraries
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Create a small example dataset
X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]])
y = np.array([0, 0, 0, 1, 1])

# Initialize the KNN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

# Predict the class of a new entry
new_entry = np.array([[5, 5]])
prediction = knn.predict(new_entry)

print("Predicted class:", prediction[0])  # Output: 0 or 1, depending on the nearest neighbors

In just a few lines of code, we created, trained, and used a KNN classifier. This example illustrates how simple it is to implement this algorithm.

Advantages and Limitations of KNN

After exploring how KNN works, it’s important to examine its advantages and limitations.

Advantages of KNN

– Simplicity: Easy to implement and understand.
– No Preliminary Assumptions: Being a non-parametric algorithm, it doesn’t make assumptions about the data distribution.
– Multi-class Adaptability: Easily handles classification problems with multiple classes.
– Versatility: Can be used for both classification and regression problems.

Limitations of KNN

– Computational Efficiency: It can be slow on large datasets as it requires calculating the distance to all points in the set.
– Sensitivity to Outliers: Outliers can negatively affect predictions.
– Curse of Dimensionality: In high-dimensional spaces, Euclidean distance can become less meaningful.
– Choosing the Value of k: Finding the optimal value of k often requires experimentation and validation.

In summary, KNN is a powerful algorithm, but it is essential to understand when and how to use it effectively.

Practical Example: Flower Species Classification with the Iris Dataset

To further illustrate the use of KNN, let’s consider the famous Iris dataset, which contains 150 flower samples divided into three species: Setosa, Versicolor, and Virginica. Each sample includes four features: sepal length and width, and petal length and width.

KNN Implementation in Python

Here’s how to implement KNN to classify flower species:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Create the KNN model with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Model accuracy: {accuracy * 100:.2f}%")

– Dataset: Using the Iris dataset with `load_iris()`.
– Data Preparation: Splitting the data into a training set and test set with `train_test_split()`.
– Model Creation: Initializing the KNN classifier with `n_neighbors=3`.
– Model Training: Training with `fit()`.
– Prediction and Evaluation: Predicting the classes and calculating accuracy with `accuracy_score()`.

This example demonstrates how KNN can be used to solve real-world classification problems.

 

Se vuoi farmi qualche richiesta o contattarmi per un aiuto riempi il seguente form