linear vs. logistic
The Case for Linear Regression vs. Logistic Regression
What is a logistic regression?
A logistic regression is a way to predict the probability of something happening. It answers questions like the probability of a customer canceling an account or the probability of a customer using a coupon. This kind of analysis is very common in academia, but after 10 years of doing analyses at hundreds of companies, in dozens of industries, I have never found a case where it the logistic model made sense for business operations to use directly. In almost all cases, the linear model is better than the logistic model.
What is a linear regression?
A linear regression has a dependent variable (or outcome) that is continuous. In other words, the dependent variable can be any one of an infinite number of possible values. Logistic regression, alternatively, has a dependent variable with only a limited number of possible values.
Why you shouldn’t use logistic regression.
Let’s take the coupon example to get the the first reason you should never use logistic regression. Do you just want to know whether the customer will use the coupon or do you actually want to know what the increase is in the amount the customer will spend if they use the coupon? In the churn example, it may be somewhat useful to know a customer might cancel an account, but if you don’t know when the customer will cancel, you can’t really do much about it. For example, if two customers both have a 60% probability of churn, but one is expected to churn in the next day and the other is expected to cancel the account in 30 days, would you not want to focus your attention on the customer who is about to leave immediately?
That’s the primary reason you shouldn’t use logistic regression and why I urge customers to always predict a number that directly impacts how they will act on information, not information for the sake of information, but information that leads to ROI.
More reasons you shouldn’t use logistic regression.
Still not convinced? Here are three additional reasons you should never use logistic regression. Let’s take an example where you are trying to predict whether a customer will cancel an account after a customer support problem. How do you define whether churn happened? Let’s say we are in a situation in which we are looking at a customer support interaction and analyzing whether a person canceled the account soon after that interaction. In a logistic regression analysis, we would come up with some magical cutoff point, say, 30 days, and anyone who canceled within 30 days would be considered a case of churn related to that customer complaint, while a cancellation after 30 days wouldn’t be considered churn.