Data Science me!
The result is that no matter how negative a value is, the result will be greater than 0 and no matter how large the number is on the positive side the result will not be larger than 1. In effect, the Sigmoid function forces the value to be between 0 and 1 and works great to represent the concept of a binary outcome (yes/no, or in our case, purchase/no purchase). This provides a way of estimating a result between 0 and 1.
So far, this gets us from coefficient values for each variable to a probability between 0 and 1 for the likelihood of a result of 0 or 1. Notice that the result of the Sigmoid function are seldom 0 or 1. In the Excel spreadsheet attached, the customers that did not buy are ordered at the top of the list, so, the 5 customers shown above are customers that did not make a purchase and the Sigmoid Function output or Probability of a 1 result are all below .5.
Here's the Excel Solver. When you open the Excel file, you can use the Solver to do the iterating to identify the best coefficients. Here, you set the objective to cell L32, which is the cell that aggregates all of the Deviations. So, it's the total of all of the deviations of each observation. And, set the objective to minimize that value.
This formula will show a very small number when the prediction is close to real value and a very high number when the prediction is far away from the real value. Below, are two of the observations from the Excel file. If you look at the first row, by the response of 0 this customer did not make a purchase and by looking at the Probability of Y = 1 (a purchase), the model is showing a .09 or a 9% likelihood that the customer made a purchase, so in effect a prediction that the customer did not make a purchase. The Deviation row shows a very small value of .2. The second observation here shows a very high Deviation value of 3.27. Looking at the Response column, the value of 1 indicates this customer did make a purchase, but, looking at the Probability Y = 1, indicates that the model predicted that this customer did not purchase, only a 19.478% chance that they bought. The model is therefore predicting this result very poorly, resulting in a high Deviation.
So far we've discussed how to calculate the value of the dependent variable from the independent variables and the coefficients, but how are the coefficients estimated? Again, not the same was as Linear Regression where Matrix Math is used to calculate the coefficients directly. For Logistic, that approach doesn't work. Instead the model will iterate through estimates of the coefficients until it finds the most optimal combination through what is called Maximum Likelihood. This requires the introduction of another concept called Deviation, the formula of which appears below (image taken from Wikipedia).
This formula is the Sigmoid function that produces the curve shown above that forms the foundation of the logistic algorithm. FYI, e in the formula is the natural log which is 2.718281828 (which you will see used below).
If you look closely at the structure of the Excel file, the individual calculation components are identified. The logit column is the variable value times the coefficient discovered for each variable. The Odds Y column is the exponentiated value of the logit. EXP(logit). The Probability Y = 1 column is Odds Y/(1 + Odds Y). Note the results in the Probability Y = 1 column is the same as the value in the Sigmoid Function column (.09487). This is to allow for an understanding of the individual components of the Sigmoid function, the Sigmoid function value is what we want and its our estimate of our 0 or 1 result.
Next, you can set the Cells it is allowed to change. Here it's set to B2 to B6, which are the coefficient values. When the solver starts those values will be set to 0 and the solver will keep changing coefficient values until it finds an optimized result.
Event will not occur
One last thing, before running the Solver, click on the reset coefficients button. This will clear out the prior model results and allow the solver to find new results.
The image generally associated with Logistic Regression is shown to the left as an S curve and the logistic model works with the probability of the predicted event, either 0 that it will not happen or 1 that it will happen. From 0 to 1 the likelihood that the given event will occur, increases. As you will see below, the formula is going to force estimated outcomes to be between 0 and 1.
To illustrate what the formula does, the set of numbers to the right were entered into the formula. The chart above (The Sigmoid Function) was created with values between -6 and +6 result in values between 0 and 1 (without actually reaching 0 or 1, i.e. .997527 at the upper end, and .002473 at the lower end).
However, the table to the right was expanded to show the effect of two extreme values of -100 and +1,000,000. Note that the result will not go above 1, no matter how large the input is and it won't go below 0 not matter how small the number is. This in effect gives a way to force results of the model to fall on a continuum between 0 and 1. FYI, -100 was selected at the low end as Excel doesn't deal well with incredibly small numbers. The output from -100 is a negative exponent, the value of which would be: .00000000000000000000000000000000000000000000372 (a very small number, but, still above 0)
To the left, if you look at the formula in Cell N11 in the Excel Spreadsheet, the sigmoid function is applied to the first observation, giving a result of .09487. Keep in mind, this is a range from 0 to 1 so, a .09487 is just slightly above 0, so, we would assume this is a zero and the customer did not make a purchase. (Look closely at the formula, that 2.718281828 is the e value, the natural log that was discussed above.)
So, how does that help us calculate the Coefficient values? Well, it gives us a number to optimize in the aggregate. Add up all of the Deviation values across the model and if you can find a combination of coefficients that will cause that Deviation value to be the smallest, it should be the best coefficient values for the model, or the Maximum Likelihood result! Now, this would take a long time to do manually as you'd have to continue changing coefficient values until the smallest deviation number was found. Luckily Excel has a Solver tool! (FYI, you may have to enable it in your version of Excel to get the spreadsheet to work).
The first thing to understand about Logistic Regression is how different it is from Linear. If you want to review the Linear Regression model first, it may help with your understanding. While the math behind a Linear model involves matrix algebra and is complex (unless you happen to be really good at matrix algebra) there's only one possible answer. You run the variables through the math and it identifies THE coefficient for each variable. In contrast, a Logistic model must iterate through a series of estimates of coefficient values for each variable until it reaches an optimized result. There's not a individual calculation that will give one result.
Logistic Regression which is used to predict Binary or Dichotomous outcomes (Yes/No) is probably the first algorithm that most people learn about after they've learned how to do Linear Regression. But be careful leaping to the conclusion that it operates in the same manner. In the case of our example, we'll be predicting whether a customer makes a purchase or does not make a purchase, but not the amount of the purchase. So a 0 will be a customer that makes no purchase, a 1 will be a customer that makes a purchase.
Deviation
Finally there's a constraint added to make sure Excel doesn't run forever and find coefficient values that don't make any sense. It's set so the constant, something similar to the Y intercept in a linear model, can't go below a certain value.
This post only goes through the functioning of the Logistic model. Feel free to download the Excel file below and take a look at how it works for yourself. One thing that was not covered here is, how are Logistic models judged and one thing that should be obvious from the discussion, R Squared, the common value used to judge Linear models, just won't work in this context. R Squared is a way of looking at the difference between a value and the value a model predicted to capture the variance.
In the case of Logistic, the prediction is the likelihood of a given observation belonging to one of two outcomes. The concept of variance doesn't work. You could compare the percent likelihood of the result with the actual result, but that doesn't make a lot of sense. As a result, other methods have been developed for judging the effectiveness of Logistic models. Those will be covered in another post though, so, please stay tuned.
Purchase/No Purchase
Event will occur
Below is the output of the models predictions compared to the Sigmoid curve. Anything above a .5 on the left side scale would be interpreted to be a purchase prediction and everything below would be interpreted to be a non-purchase prediction.