Download our e-book of Introduction To Python

Matplotlib - Subplot2grid() FunctionDiscuss Microsoft Cognitive ToolkitMatplotlib - Working with ImagesMatplotlib - PyLab moduleMatplotlib - Working With TextMatplotlib - Setting Ticks and Tick LabelsCNTK - Creating First Neural NetworkMatplotlib - MultiplotsMatplotlib - Quiver PlotPython - Chunks and Chinks View More

How can I write Python code to change a date string from "mm/dd/yy hh: mm" format to "YYYY-MM-DD HH: mm" format? Which sorting technique is used by sort() and sorted() functions of python? How to use Enum in python? Can you please help me with this error? I was just selecting some random columns from the diabetes dataset of sklearn. Decision tree is a classification algo...How can it be applied to load diabetes dataset which has DV continuous Objects in Python are mutable or immutable? How can unclassified data in a dataset be effectively managed when utilizing a decision tree-based classification model in Python? How to leave/exit/deactivate a Python virtualenvironment Join Discussion

Shashank Shanu

2 years ago

- Introduction
- Basic idea behind K-Nearest neighbours
- How KNN works?
- Intuition behind KNN algorithm
- How to choose “k” value
- Some pros and cons of KNN algorithms

Pros

Cons

- KNN Implementation in python

I hope you
enjoyed my previous articles. So, Today I will try to provide you with an end to end
explanation and implementation of one of the most popular machine learning
algorithm K-Nearest Neighbor also known as KNN.

It is one of the simplest and widely
used machine learning algorithms. KNN is a non-parametric and a type of lazy
learning algorithm which is based on feature-similarity.

When we are saying KNN is a
non-parametric, it means that it does not make any assumptions on the
underlying data distributions.

KNN is also a lazy algorithm (as
opposed to an eager algorithm). Does that mean that KNN does nothing? Not
quite. What this means is that it does not use the training data points to do
any generalization. Or we can say that there is no explicit training phase or
it is very minimal. This also means that the training phase is pretty fast. Therefore,
it helps us to classify new data points immediately as they present themselves.

KNN can also be used for Regression by
taking the average or median of the values of its k-nearest neighbours.

- KNN Classification rule is to assign to a test sample the majority category label of its K-Nearest training samples.
- In practice,
**K**is usually chosen to be odd, so as to avoid ties. Here K represents a number of clusters. - The
**K=1**rule is generally called the**“nearest neighbours classification rule”**

As of now, you may get some idea about
KNN. Let me explain to you how this algorithm works.

When a new data point comes. We
calculate distances based on distance matrices say Euclidean distance of each
point from the new data point. And as per the K value (say k =5), we select the
smallest 5 distances and then we try to see in which class this distance
falls. The majority distances class is chosen as the new data point class. And
we say that this data point belongs to this class.

The decision of how to choose the value
of k (the number of clusters) to be used for KNN determines how the model generalize
to future data. The balance between overfitting and underfitting the training
data is a problem known as **“Bias Variance Tradeoff”**

When we choose a large value of “K” it
reduces the impact or variance caused by noisy data, but bias the learner so
that it runs the risk of ignoring small, but important patterns.

If K is too small, our model becomes
sensitive to noisy data points.

If K is too large, our model tries to
includes points from the other class of neighbourhoods clusters.

So, to avoid all these problems, we
have to choose an optimal number of clusters. So that our model performs well for
new data points.

There are many different methods used
to choose the number of cluster “k”. But here I will give you a brief idea
about the **“Elbow method”.**

In this method, we plot a graph between A number of clusters and K-means score (Mean error).

The value of “K” is chosen where there
is minimum error or misclassification as shown in the above figure.

- KNN does not make assumptions about data.

- It is a simple algorithm and very easy to understand.

- It can be used for both classification and regression problems.

- Computational complexity — All of the training data points must be present in memory in order to calculate the closest K neighbours.

- KNN is very sensitive to irrelevant features.

- It is also sensitive to the scale of the data what we are feeding since we’re computing the distance to the closest K points.

- As of now, you learned all the theoretical aspects of the K-Nearest Neighbors algorithm. Let’s implement it in python and see the practical implementation of it.

Here, in this example of KNN
implementation in python, we’ll be using the build-in dataset of the breast
cancer from the **sklearn.datasets** module.

```
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set()
```

This dataset consists of data related to
tumours and classifies tumours into two categories (**malignant and benign**).
This dataset is having around 30 features in it. In the real world, we look at
the correlations between different features and select a subset of features
from them which plays the greatest role in determining whether a tumour is
malignant or benign.

But for simplicity, we will pick a
couple at random. We should encode categorical data points for it to be
interpreted by the model (let’s say, malignant = 0 and
benign = 1).

```
# Loading dataset and creating dependent and independent variables.
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
X = X[['mean area', 'mean compactness']]
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
y = pd.get_dummies(y, drop_first=true)
```

As while building model we need
training and testing dataset. So, using sklearn method train_test_split
we are splitting our dataset and keep aside 25% of the samples in the original
dataset for testing.

```
#Spliting dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
```

The sklearn library
provides us with a layer of abstraction on top of Python. So, in order to make use
of the KNN algorithm, we create an instance of KNeighborsClassifier.
By default, the KNeighborsClassifier looks for the 5 nearest
neighbours. But when we want to change the number of clusters, we must have to
explicitly assign the classifier to use Euclidean distance for determining the
proximity between neighbouring points.

```
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)
```

Now, when our model gets trained, we use that trained model
to predict whether a tumour is** benign** or **malignant **based on** **its mean compactness and area.

`y_pred = knn.predict(X_test)`

With the help of visualization libraries,
we compare the predictions made by our trained model with the samples inside
the testing dataset.

```
sns.scatterplot(
x='mean area',
y='mean compactness',
hue='benign',
data=X_test.join(y_test, how='outer')
)
```

```
plt.scatter(
X_test['mean area'],
X_test['mean compactness'],
c=y_pred,
cmap='coolwarm',
alpha=0.7
)
```

There are some other metrics are available in sklearn for evaluating
our model is by computing the confusion matrix or accuracy score. The numbers
on the diagonal of the confusion matrix correspond to correct predictions
whereas the others imply false positives and false negatives.

`confusion_matrix(y_test, y_pred)`

```
array([[45, 9],
[ 3, 86]], dtype=int64)
```

`accuracy_score(y_test, y_pred)`

`0.916083916083916`

Here, we can see we is giving us an
accuracy of 91% which is quite good.

I hope after you enjoyed reading this article and finally,
you came to know about **KNN algorithm and its implementation in Python.**

For more such blogs/courses on data science, machine
learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.

Thanks for reading…

Happy Programming…