Machine Learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on models and inference instead.
I’ve made another blog post that I believe should give you a better introduction to machine learning. Please take a look at REDO: Intro to Machine Learning to use K-nearest neighbor
If you are following along with my blog then you know that I am giving a workshop at #DEVWEEK2019. My topic of choice is Javascript and Machine Learning using Tensorflowjs. In order to get prepared for the workshop, I am releasing a few articles to help get you started with the basics. This tutorial is to go over some of the terminology being used in the Machine Learning world.
Machine Learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on models and inference instead.
Before we can get started we need to define the problem we are going to solve. Let us say you work in car insurance and your boss asks…
If it hails 29 times this year in San Francisco, How much is the cost in damages?
Well, let’s break down the problem…
If the number of times it hails changes, we will probably see a change in the damage cost.
Meaning if it hails 15 times in San Francisco this next year, then the damage would be a lot less then if it were to hail 32 times.
The independent variable is the amount of time it hails per year and the damage cost would be the dependent variable.
(independent variable) hail per year++ === (dependent variable) damage cost++
(independent variable) hail per year– === (dependent variable) damage cost–
These variables have special names to them in Machine Learning.
A feature is the independent variable. The exact defination from wikipedia:
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed.
To learn more about features please visit the wiki: Machine Learning Features
A Label is the dependent variable. This is the expected prediction, in our case is the damage cost.
You have to categorize your friends by their height and weight into three different groups likely FIT, NoN-FIT. so here height , weight of each person are features and group names (FIT, NoN-FIT) are labels. Enjoy… : )
To learn more about labels please visit the Quora question: Machine Learning Label
Almost always we are going to have to manipulate multiple sources of data to get the exact information we are looking for. For software engineers this can range from multiple API requests, getting information from multiple tables in your relational database, or can be multiple sections of objects in your NoSQL database. There are more ways to gather information but the important thing is to manipulate the data to create a data structure that will work with your machine learning program.
The next part will be to research your data sources. Your research can include many different sources to obtain the right information. Data points can come from a wide variety of information, such as APIs, spreadsheets, web scraping, and so on.
Research for proper features that support your label.
After you have collected your data sources for both your label and features, then it is time to decide what kind of output you would like to predict. There are many different outputs to choose from but for the sake of this tutorial, we will be discussing two common type of output. Those types are Classification and Regression.
Classifications is the process of predicting the class of given data points. Classes are sometimes called as features (or targets)/labels or categories. Classifications predictive modeling is the task of approximating a mapping function from input variables (features) to discrete output variables (labels).
The value of our labels belongs to a discrete set. This means only a few outputs are available, think of it as true and false.
Regression predictive modeling is the task of approximating a mapping function from input variables to a continuous output variable. A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes. For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.
The value of our labels belongs to a continuous set. This means the output will be in a range of values rather than a discrete set.
Fundamentally, classification is about predicting a label and regression is about predicting a quantity.
Don’t forget that “Features” are categories of data points that affect the value of the “Label”
Datasets almost always need to be cleaned up for formatting
Regression used with continuous values, classification used with discrete values.
There are many different algorithms that exist, each of these algorithms has their pros and cons. Research a few algorithms and create models using each one this way you can see which algorithm has the best performance and predictions.
Models relate the value of “Features” to the value of “Labels”
Want to come to DEVELOPERWEEK in SF BAY AREA (#DEVWEEK2019)?
I am giving away 25 free OPEN tickets to #DEVWEEK2019. They are on a first come first serve basis, so if you are in town and want to learn more about machine learning come to my workshop. We will do a quick review of these subjects and dive into using Javascript to perform machine learning. I will be releasing the tickets on our newsletter so please subscribe below.
You will never miss my podcast, latest news, etc.