All Courses

KNN algorithm and its implementation in Python

Shashank Shanu

3 years ago

KNN Algorithm | insideAIML
Table of Contents
  • Introduction
  • Basic idea behind K-Nearest neighbours
  • How KNN works?
  • Intuition behind KNN algorithm
  • How to choose “k” value
  • Some pros and cons of KNN algorithms
                Pros
               Cons
  • KNN Implementation in python

Introduction

          I hope you enjoyed my previous articles. So, Today I will try to provide you with an end to end explanation and implementation of one of the most popular machine learning algorithm K-Nearest Neighbor also known as KNN.
It is one of the simplest and widely used machine learning algorithms. KNN is a non-parametric and a type of lazy learning algorithm which is based on feature-similarity.
When we are saying KNN is a non-parametric, it means that it does not make any assumptions on the underlying data distributions.
KNN is also a lazy algorithm (as opposed to an eager algorithm). Does that mean that KNN does nothing? Not quite. What this means is that it does not use the training data points to do any generalization. Or we can say that there is no explicit training phase or it is very minimal. This also means that the training phase is pretty fast. Therefore, it helps us to classify new data points immediately as they present themselves.
KNN can also be used for Regression by taking the average or median of the values of its k-nearest neighbours.

Basic idea behind K-Nearest neighbours

“If its walk like a duck, quacks like a duck, then it’s probably a duck.”
  • KNN Classification rule is to assign to a test sample the majority category label of its K-Nearest training samples.
  • In practice, K is usually chosen to be odd, so as to avoid ties. Here K represents a number of clusters.
  • The K=1 rule is generally called the “nearest neighbours classification rule”
As of now, you may get some idea about KNN. Let me explain to you how this algorithm works.

How KNN works?

Step 1: Select random number k (Number of clusters). Let say we choose k=5.
Select random number k | insideAIML
Step 2: Take the K nearest neighbours of the new data point according to their Euclidean distance or some other distance metrics but most used is Euclidean distance.
K nearest neighbours | insideAIML
Step 3: This is the last step and among these neighbours, we count the number of data points in different categories and assign the new data point to the category where we counted the greatest number of neighbours.
K-nearest neighbours’ algorithm | insideAIML
The above-mentioned points are the basic steps we try to follow while applying K-nearest neighbours’ algorithm.

Intuition behind KNN algorithm

          When a new data point comes. We calculate distances based on distance matrices say Euclidean distance of each point from the new data point. And as per the K value (say k =5), we select the smallest 5 distances and then we try to see in which class this distance falls. The majority distances class is chosen as the new data point class. And we say that this data point belongs to this class.
Note: KNN can also, be used for Regression problem but it is mainly used for classification problems.

How to choose “k” value

          The decision of how to choose the value of k (the number of clusters) to be used for KNN determines how the model generalize to future data. The balance between overfitting and underfitting the training data is a problem known as “Bias Variance Tradeoff”
When we choose a large value of “K” it reduces the impact or variance caused by noisy data, but bias the learner so that it runs the risk of ignoring small, but important patterns.
If K is too small, our model becomes sensitive to noisy data points.
If K is too large, our model tries to includes points from the other class of neighbourhoods clusters.
So, to avoid all these problems, we have to choose an optimal number of clusters. So that our model performs well for new data points.
There are many different methods used to choose the number of cluster “k”. But here I will give you a brief idea about the “Elbow method”.
Elbow method | insideAIML
In this method, we plot a graph between A number of clusters and K-means score (Mean error).
The value of “K” is chosen where there is minimum error or misclassification as shown in the above figure.

Some pros and cons of KNN algorithms

Pros:

  • KNN does not make assumptions about data.
  • It is a simple algorithm and very easy to understand.
  • It can be used for both classification and regression problems.

Cons:

  • Computational complexity — All of the training data points must be present in memory in order to calculate the closest K neighbours.
  • KNN is very sensitive to irrelevant features.
  • It is also sensitive to the scale of the data what we are feeding since we’re computing the distance to the closest K points.
  • As of now, you learned all the theoretical aspects of the K-Nearest Neighbors algorithm. Let’s implement it in python and see the practical implementation of it.

KNN Implementation in python

         Here, in this example of KNN implementation in python, we’ll be using the build-in dataset of the breast cancer from the sklearn.datasets module.
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set()
This dataset consists of data related to tumours and classifies tumours into two categories (malignant and benign). This dataset is having around 30 features in it. In the real world, we look at the correlations between different features and select a subset of features from them which plays the greatest role in determining whether a tumour is malignant or benign.
But for simplicity, we will pick a couple at random. We should encode categorical data points for it to be interpreted by the model (let’s say, malignant = 0 and benign = 1).
# Loading dataset and creating dependent and independent variables.
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
X = X[['mean area', 'mean compactness']]
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
y = pd.get_dummies(y, drop_first=true)
As while building model we need training and testing dataset. So, using sklearn method train_test_split we are splitting our dataset and keep aside 25% of the samples in the original dataset for testing.
#Spliting dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
The sklearn library provides us with a layer of abstraction on top of Python. So, in order to make use of the KNN algorithm, we create an instance of KNeighborsClassifier. By default, the KNeighborsClassifier looks for the 5 nearest neighbours. But when we want to change the number of clusters, we must have to explicitly assign the classifier to use Euclidean distance for determining the proximity between neighbouring points.
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)
Now, when our model gets trained, we use that trained model to predict whether a tumour is benign or malignant based on its mean compactness and area.
y_pred = knn.predict(X_test)
With the help of visualization libraries, we compare the predictions made by our trained model with the samples inside the testing dataset.
sns.scatterplot(
    x='mean area',
    y='mean compactness',
    hue='benign',
    data=X_test.join(y_test, how='outer')
)
metrics are available in sklearn | insideAIML
plt.scatter(
    X_test['mean area'],
    X_test['mean compactness'],
    c=y_pred,
    cmap='coolwarm',
    alpha=0.7
)
There are some other metrics are available in sklearn for evaluating our model is by computing the confusion matrix or accuracy score. The numbers on the diagonal of the confusion matrix correspond to correct predictions whereas the others imply false positives and false negatives.
confusion_matrix(y_test, y_pred)
Output:
array([[45,  9],
       [ 3, 86]], dtype=int64)
accuracy_score(y_test, y_pred)
Output:
0.916083916083916
Here, we can see we is giving us an accuracy of 91% which is quite good.
I hope after you enjoyed reading this article and finally, you came to know about KNN algorithm and its implementation in Python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Programming…

Submit Review