REDO: Intro to Machine Learning to use k-nearest neighbor

A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm) is one that relies on a labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

Written by
Sophie Moore
Published on
May 25, 2022
Read time
Category

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. HOLD UP…… Let’s dive a little deeper.

via GIPHY

Breaking it down

A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm) is one that relies on a labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

Let’s imagine a computer as a child 🧒 and we are its supervisor (e.g parent, guardian, or teacher), and we want the child (computer) to learn what a dog 🐶 looks like. We will show the child a lot of different pictures, some of which are dogs and the rest could be pictures of anything (cats 😼, pigs 🐷, cows 🐮, etc.)

When we see a dog, we shout “dog!” and when it’s not a dog, we shout “no, not a dog!” After we have repeated this process with the child, we show them an image and ask them “dog?” and they will correctly (most of the time at least) answer by saying “dog!” or “no, not a dog!” depending on what the picture is. That is supervised machine learning.

Supervised machine learning algorithms are used to solve classification or regression problems.

A classification problem has a discrete value as its output. For example, “likes mantu” and “does not like mantu” are discrete. There is no middle ground, they either like it or they don’t. The analogy above of teaching a child to identify a dog is another example of a classification problem.

Mantu is a delicious Afghan dish. If you haven’t tried it, YOU MUST. Better yet, make it yourself at home with this Mantu Recipe

The image shows a basic example of what classification data might look like. We have a predictor (or set of predictors) and a label. In the image, we might be trying to predict whether someone likes mantu (1) or not (0) based on their age (the predictor).

It is standard practice to represent the output (label) of a classification algorithm as an integer number such as 1, -1, or 0. In this instance, these numbers are purely representational. Doing any type of mathematical operations on these representations would be pointless. Let’s think about it what is “likes mantu” + “does not like mantu”? Exactly. We can’t add them, so we shouldn’t try.

A regression problem has a real number (floating point) as its output. For example, we could use the data in the table below to estimate someone’s weight given their height.

Data used in a regression analysis will look similar to the data shown in the image above. We have an independent variable (or set of independent variables) and a dependent variable (the thing we are trying to guess given out independent variables). For instance, we could say height is the independent variable and weight is the dependent variable.

Also, each row is typically called an example, observation, or data point, while each column (not including the label/dependent variable) is often called a predictor, independent variable, or feature.

An unsupervised machine learning algorithm makes use of input data without any labels. To simplify there is no teacher (label) telling the child (computer) when it is right or when it has made a mistake so that it can self-correct.

Unlike supervised learning that tries to learn a function that will allow us to make predictions given some new unlabeled data, unsupervised learning tries to learn the basic structure of the data to give us more insight about the data.

How K-Nearest Neighbor works

The KNN algorithm makes an assumption that similar things exist in close proximity. In other words, similar things are near each other.

The apple does not fall far from the tree.

Image shows how similar data points typically exist close to one another

Did you notice in the graph above that most of the time, similar data points are close to one another. The KNN algorithm depends on this assumption being true enough for the algorithm to be useful. KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some math that we have learned in our childhood that calculates the distance between points on a graph.

Just In Case: You need a quick refresher on how to calculate the distance between two point review the following article “Distance Between two points“. Make sure you read the entire article and watch the video because we will be using this knowledge when we build KNN from scratch.

High Level Overview: The K-Nearest Neighbor Algorithm

  1. Load the data
  2. Initialize K to your chosen number of neighbors
  3. For each example in the data
  4. Calculate the distance between the query example and the current example from the data
  5. Add the distance and the index of the example to an ordered collection
  6. Sort the ordered collection of distances and indices from the smallest to largest (in ascending order) by the distances
  7. Pick the first K entries from the sorted collection
  8. Get the labels of the selected K entries
  9. If regression, return the mean of the K labels
  10. If classification, return the mode of the K labels

To finish up

After talking to the Lead Data Science Instructor @ Galvanize in SF. My last article didn’t do much justice in explaining the basics of Machine learning. I wanted a redo so I created this article with the addition of adding in an explanation of K-Nearest Neighbor. If you learned something from this post please like and share it on your networks. At the end of the week, I will be releasing a tutorial of implementing K-Nearest Neighbor using Javascript and TensorflowJS.

Subscribe to my newsletter

You will never miss my podcast, latest news, etc.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.