May 2014 Newsletter

What type of Predictive modelling should you use in your company?

Decision Trees vs Logistic Regression Models

Data today is being generated by every device imaginable. Data is coming from multiple sources at an alarming velocity, volume, variety and veracity. It is estimated that 2.5 quintillion bytes of data are created each day-so much that 90 percent of the data in the world today was created in the last two years alone.

With all this data, structured and unstructured, many companies are struggling to get the necessary insights in their data. They are rushing to implement BI tools (rear-view monitoring) and Predictive Analytics. When building Predictive Analytics models (Churn, Upsell, CrossSell, Revenue and Price Optimization models) it's important to understand how accurate will the model be. Remember that Predictive Analytics is all about probabilities of a customer buying a certain product or service or attriting. 

The more accurate your predictive model is the better the prediction. 

 

One of the questions that we often receive from our clients and prospects is the efficiency of decision trees (rule based models) vs. logistic regression models in churn analytics.

The prevailing opinion appears to be that logistic regression models produce more robust and accurate findings, as decision trees have a greater tendency to overfit the data.

Rather than relying on contextual random observations, we compiled studies from various fields, which compares the two approaches.

 
Telecommunications / Mobile phones
Owczarczuk (2010), tested the usefulness of the decision trees vs. logistic models to predict churn of the clients of the Polish cellular telecommunication company. The study deals with prepaid clients who are far more likely to churn, and who are less stable and much less is known about them, with 1381 potential variables derived from the clients' usage. The models were tested for stability across time for all the percentiles of the lift curve. The test sample is collected six months after the estimation of the model. The main finding from this research is that linear models, especially logistic regression, are a very good choice when modeling churn of the prepaid clients. Decision trees are unstable in high percentiles of the lift curve, and their usage is not recommended.

Health Care
In another study, Long, Griffith, Selker, and D'Agostino (1993) compares the performance of logistic regression to decision-tree induction in classifying patients as having acute cardiac ischemia. The comparison was performed using a database of 5,773 patients. Both the ability to classify cases and ability to estimate the probability of ischemia were compared on the default tree generated by simulation. They were also compared on a tree optimized on the learning set by increased pruning of over specified branches, and on a tree incorporating clinical considerations. The logistic regression models were superior to original decision tree models. The improved decision tree models came closer in performance to logistic models; however, logistic regression models performed better than any of the decision tree models.

Financial Institution Banking
Nie, Rowe, Zhang, Tian, Shi (2011) studied the performance of logistic regression models and decision trees in a churn prediction model using credit card data collected from a real Chinese bank. The contribution of four variable categories: customer information, card information, risk information, and transaction activity information are examined. The paper analyzes a process of dealing with variables when data is obtained from a database instead of a survey. Instead of considering all 135 variables into the model directly, it selects certain variables from the perspective of not only correlation but also economic sense. In addition to the accuracy of analytic results, the paper designs a misclassification cost measurement by taking the two types error and the economic sense into account, which is more suitable to evaluate the credit card churn prediction model. They conclude that logistic regression models perform better than decision tree models. 

As all these papers indicate, logistic regression models perform better than decision tree models.

With logistic regression models one can increase the level of accuracy of the predictive model without assumptions. Variables are validated empirically every time the data is refreshed and the models are run.

This is the reason, why our predictive models are so accurate and why our projects provide a significant return on investment in each project.



 

References: 
  • Long, W.J., Griffith, J.L., Selker, H.P., D'Agostino, R.B. 1993. A comparison of logistic regression to decision-tree induction in a medical domain, Computers in Biomedical Research, 26, 74-97. 
  • Nie, G., Rowe, W., Zhang, L., Tian, Y., Shi, Y. 2011. Credit card churn forecasting by logistic regression and decision tree, Expert Systems with Applications, 38, 15273-15285. 
  • Owczarczuk, M. 2010. Churn models for prepaid customers in the cellular telecommunication industry using large data marts, Expert Systems with Applications, 37, 4710-4712. 

 


To learn more, please visit us at www.tdtanalytics.com 
or contact
+1 (416) 900-0360 Ext 10  


Connect with us on these social networks: