Data Science me!

Let's simplify a little here and leave out the Logarithmic component of the formula. Here we can see that the formula is really just a ratio of one outcome to the total minus the ratio of the other outcome to the total. The tree above is built in a recursive manner. All four predictor variables are run through the formula and the most predictive, in this case, Income is selected (see the tree). The formula is then run on each of the branches, with the three remaining variables and the process continues until all variables have been used.

Split on high or low income gives the highest "information gain" out of all of the variables and is used first, High vs Low Income. (H/L)

The algorithm then identifies the variable from those that remain that provides the highest information gain. For High income that is Gender (F/M), for Low Income, that is Age (Y/O).

The Excel file below demonstrates how the algorithm works in on a simple data set included. Feel free to break it open to see how Decision Trees work.

Log: The chart below includes the same information, but with the addition of the orange curve which is the formula with the logarithmic function applied. Both ends of the curve are still the most predictive variables but now all of the numbers are positive. The most important concept to keep in mind is that at this stage, is isn't important whether the variable is predicting a purchase or not, what matters is which variable has the most predictive capacity. Ultimately, the formula is just a ratio of one outcome to another outcome with a logarithm added in order to equalize the two ends of the spectrum to make variable comparison easier.

Decision Trees are one of the first algorithms encountered by data scientist in training because they are both effective and relatively easy to understand. This is a decision tree. To interpret a decision tree, you follow the path down to the last row. In this tree, if a customer has high income, proceed down the left branch to gender. If the customer is female they will likely purchase and the model is 85% accurate, 6/(6+1).

The math behind the decision tree

Decision Tree File

Decision trees can be built using several different mathematical formulas. This is the formula used in this model, relying on information gain or entropy in order to identify which variables to use. Our data set has 4 variables, Income, Gender, Age and Education level and the formula below is applied to each option on all four variables with the variable producing the largest difference being selected. Of the variables available, this formula will select the one with the greatest impact on what is being predicted.

The predictive accuracy is shown at the bottom of the tree. In this case, out of 7 predicted purchases, 6 are correct and 1 is incorrect.

No Log: What about that Logarithmic function (LOG in the formula above)? The two charts below show the effect of the logarithm. On the left the chart shows all of the possible combinations of a variable. At the far left if a variable has a given value, the customer will never purchase at the far right the customer will always purchase. When looking at this variable, the two ends of the line are the most predictive, so, a variable showing either of these qualities would be highly predictive. The closer to the middle of the line, or the more equally weighted the variable is around purchase/no purchase, the less predictive value the variable provides.