Linear Regression - Starting Phase of Machine Learning.
10 months ago
Linear Regression in Machine Learning
A step by step
explanation of the Linear Regression Algorithm.
Hope you are well and staying safe at your place. As we all know how this
COVID-19 pandemic came and doesn't want to go from our life.
But as the whole world is fighting to get rid of this pandemic. I thought
why can't I share some things which I know so that many people may get benefits
So Let's start without wasting much time.
Before directly going deep into the Linear regression algorithm.
Let us first understand
Let us first understand
Regression is a statistical technique that shows an algebraic relationship
between two or more variables.
Based on this algebric relationship (rather than a function), one can
estimate the value of a variable, given the values of the other variables.
Usually, correlation is used to check whether there is any relationship
between the two variables. If any relationship found, regression is used to
find the degree of relationships that can be then used for prediction.
Some of the examples are:
Predict rainfall in cm for month.
Predict stock price for next day.
Now as you got an idea about what is regression? Let’s move forward and
see what are the types of regressions?
In this article I will explain you
about Linear Regression and later I will try to take you through the other
types of regressions.
What is a Linear Regression?
Linear Regression is one
of the most fundamental algorithms in the Machine Learning world which comes under supervised learning. Basically it performs a regression task. Regression models predict a
dependent (target) value based on independent variables. It is mostly used for
finding out the relationship between variables and forecasting. Different
regression models differ based on – the kind of relationship between the dependent
and independent variables, they are considering and the number of independent
variables being used.
Graph For Linear Regression
regression performs the task to predict a dependent variable value (y) based on
a given independent variable (x). So, this regression technique finds out a
linear relationship between x (input) and y (output). Hence, the name is Linear
In the figure above, X (input) is the work experience and Y (output) is the
salary of a person. The regression line is the best fit line for our model.
Regression may further divided into
1) Simple Linear Regression/ Univariate Linear
2) Multivariate Linear Regression
Simple Linear Regression/ Univariate
When we try to find out a
relationship between a dependent variable (Y) and one independent (X) then it
is known as Simple Linear Regression/ Univariate
mathematical equation can be given as:
= β0 + β1*x
Y is the response or the target variable
x is the independent feature
β1 is the coefficient of x
β0 is the intercept
β0 and β1 are
the model coefficients. To create a model, we must
"learn" the values of these coefficients. And once we have the value
of these coefficients, we can use the model to predict the Sales!
The main aim of the regression is to
obtain a line that best fits the data. The best fit line is the one for which
total prediction error (all data points) are as small as possible. Error is the
distance between the points to the regression line.
suppose we have a dataset that contains information about the relationship between
‘a number of hours studied’ and ‘marks obtained’. Many students have been
observed and their hours of study and grade are recorded. This will be our
training data. The goal is to design a model that can predict marks if given the
number of hours studied. Using the training data, a regression line is obtained
which will give a minimum error. This linear equation is then used for any new
data. That is, if we give a number of hours studied by a student as an input, our
model should predict their mark with minimum error.
Next let’s learn how to learn or estimate Model Coefficients.
the coefficient of x
The coefficients are estimated using the least-squares
criterion, i.e., the best fit line has to be calculated that minimizes
the sum of squared residuals (or "sum of squared
Let’s understand the intuition
Have a quick look at the plot. Now consider each point, and know
that each of them has a coordinate in the form (X, Y). Now draw an imaginary
line between each point and the current "best-fit" line. We'll call
the distance between each point and the current best-fit line as E. To get a
quick image of what we're trying to visualize, take a look at the picture
Let’s understand what elements are present in the diagram
red points are the observed
values of x and y.
blue line is the least
green lines are the residuals,
which is the distance between the observed values and the least squares
So before, we're labelling each green line as
having a distance E, and each red point as having a coordinate of (X, Y). Then
we can define our best fit line as the lines having the property were:
So how do we find this line?
The least-square line approximating the set of
this is basically just a similar representaion of the
standard equation for a line:
Y = mx + c
So how to calculate the model coefficients?
The values b0 and b1
must be chosen so that they minimize the error. If the sum of squared error is
taken as a metric to evaluate the model, then the goal to obtain a line that best
reduces the error. The error formulae are given as:
NOTE: If we don’t square the error, then positive
and the negative point will cancel out each other.
For model with one
independent variable (say x),
Some of the assumptions
to consider whenever we are dealing with regression task: -
The regression model is linear
in terms of coefficients and error term.
The mean of the residuals is
The error terms are not
correlated with each other, i.e. given an error value; we cannot predict
the next error value.
The independent variables X are not dependent on the dependent variable (Y) is known as Exogeneity. This, in layman term, generalizes that in no way should the error term be predicted given the value of independent variables.
The error terms have a constant variance, i.e. homoscedasticity.
No Multicollinearity, i.e. no independent variables should be correlated with each other or affect one another. If there is multicollinearity, the precision of prediction by the OLS model decreases.
The error terms are normally distributed.
general equation of a straight line is:
Y = mx + c
It means that if we have the value of m and c, we can predict
all the values of y for corresponding x. During construction of a Linear
Regression Model, the computer tries to calculate the values of m and c to get
a straight line.
But the question arises:
How do we know this is the best fit line?
The best fit line is obtained by minimizing the error/residual.
Residual is the distance between the actual Y and the predicted Y,
as shown below:
Mathematically, Residual is:
r = actual y – predicted
Hence, the sum of the square of residuals can be written
In the above figure c is written as b.
As we can see in the figure above that the residual is a function
of both m and b, so differentiating partially with respect to m and b will give
For getting the best fit line, error/residual should be minimum. The minima of a function
occur where the derivative=0. So, equating our corresponding derivatives to 0,
Ideally, if we'd have an equation of one dependent and one independent
variable the minima will look as follows: