The core of linear regression model is the slope intercept formula of a line, shown below. This formula would of course be for a single variable linear model. If more variables were used in the model, the formula below expands for each additional variable. The image below uses the terminology of independent variable for the variable for which the value and dependent variable for the value being predict. To predict the dependent variable, coefficients must be found for each independent variable. This is where regression gets complicated!
Next all of the dependent variables are treated in a similar manner however, these produce that 5 X 5 matrix that was noted earlier. The goal here is to reduce all of the observations down to that minimum number of equations needed, so, from 10 observations down to 5 equations. (See Excel file for detailed calculations)
In the final step, the co-factors for each of the variables in the co-factor matrix are divided by the determinant creating a matrix of ratios for each variable co-factor. This new ratio is the multiplied by the original target table using matrix multiplication and the sum of that result is the coefficient for that variable. The color coding below shows how matrix multiplication works with the top left hand value in yellow being multiplied by the first value in the target matrix (Yellow x Yellow), the next cell (Orange), multiplied by the 2nd cell in the target matrix, etc. (Note, for matrix multiplication, the number of columns in the first matrix must match the number of rows in the second matrix).
You might recall the process to the right from a high school or college Algebra class. With more than one equation, it is possible to solve for the value of a variable. In fact, as long as there are at least as many equations as the number of unknown values, this process can be repeated until all unknown variables have been calculated.
As demonstrated to the right, this is done by multiplying both equations by a value that will make the coefficients of a matching variable equal. In this case the X variable was targeted. By multiplying the first equation by 5 and the second equation by 2 each equation now has the value of 10X as the first component. Simply subtracting the second formula from the first will produce an equation with only a Y variable which can now be calculated from the information that is available. Once the value for Y has been determined, it can be substituted back into either equation in order to calculate the value of X.
This is the heart of Linear Regression, except, it's rare to have such a small number of variables in a linear regression. Therefore an approach is needed to handle more variables and one already exists called Matrix Algebra...stay with me here, it's not as hard as it sounds (and the attached Excel file fully demonstrates)!
One really important concept to keep in mind is just like when using two equations to calculate two unknown variables, Linear Regression will require at least as many observations as variables in order to calculate a coefficient for each variable. In the sample data set, there are 10 observations with 5 variables so only 5 actual equations will be needed. A matrix is just another way of displaying the values of the variables. To complete the regression, as there are five variables, 5 equations will be needed, or more clearly stated all of the variability in the observations will be reduced down to a 5 X 5 matrix.
By multiplying each variable by the predicted amount, the Amount to the left is what we are trying to predict and represents the amount a customer purchased. This reduces all of the information available about the amount purchased, down to a 5 X 1 Matrix as shown to the right. (See the Excel file for detailed calculations).
Data Science me!
Linear regression is a method of predicting continuous values, including variables such as incomes, sales amounts, weights or other variables where there will be a distribution of values in a range.
A single variable linear regression model is easiest to understand. Think of filling up your car with gas. There are a number of units or fractions thereof, in gallons and a price per gallon. In a linear model there's the value of the variable and there's a coefficient value. In the gasoline example, 2 gallons of gas would be the variable and if gas is selling for $3.50 per gallon, 3.5 is the coefficient.
While this description is for a single variable regression model, linear regression can be performed using many variables. The important concept to understand is that the linear relationship of each variable needs to hold. That is as the variable value increases, the coefficient value is fixed and the predicted value should increase proportionally.
The Linear Regression demonstrated uses the sites standard data set, but in this case we'll be trying to predict how much a customer will buy and only using those customers that actual made a purchase, so, 10 observations rather than the entire 20 observation data set.
Y Intercept
Predicted Value or dependent variable (red line)
The next step is a process of multiplying the various values in the matrix by other values in the matrix to create a number that captures all of the relationships. Think of this step to being synonymous to the 5 and 2 in the Algebraic formula above. Co-factor computation for a 2 X 2 matrix is easy, 3 X 3 becomes a little more time consuming, but a 5 X 5 matrix would produce a tremendous number of calculations. Fortunately, there are sites like the following where a matrix can be entered and the cofactors produced. http://comnuan.com/cmnn01013/
Actual Value (blue dots)
First of all, are now you as thankful as I am for cheap high powered computers and software made for doing regression. This is just 4 variables, imagine doing a regression with 10s of variables!
This post has already gotten long and very complicated, so, the application of the results are in another post. To see how accurate the model is, go here:
Luckily, Excel has a Linear Regression function built in that is very easy to run. As you can see below, the Excel Linear Regression Output matches the manual calculated output to the decimal!
The Determinant is a single number that summarizes all of the values in the original Y x Y Matrix. The determinant is also an area where the math gets extensive very quickly. The Determinant for a 4 x 4 matrix is calculated in the Excel file. The yellow highlights below give you an idea of what is going on. All of the highlighted cells are multiplied to produce a number which is then added or subtracted. The number of individual calculations can be determined as it is a factor, so for a 4 X 4 matrix, the total calculations would be 4! or 4 * 3 * 2 * 1 = 24. The 5 variables in the data set being used would produce 120 individual calculations...and a lot of time to produce in this format. Luckily there is a function in Excel that will calculate this. Note that in the file the determinant calculation is demonstrated and matches Excel's calculation of the determinant. For the 5 x 5 matrix, the Excel function is used.