Multiple Regression - Summer of Stats

(or “Why Knowing the Business Beats Knowing Regression”)

Last week, we used a linear regression model to accurately convert temperatures from Celsius to Fahrenheit.

But what happens when you have to estimate something a bit more complex?

“This house is 2338 sq. ft…so, how much should it cost?”

Obviously, we’ll need to account for more than 1 factor to model this. Luckily, we can do just this, using Multiple Regression!

What is Multiple Regression?

Multiple linear regression is similar to simple linear regression. The main difference is that multiple regression takes more than 1 variable into account in order to make predictions.

By adding more factors into the regression model, we likely can improve its ability to explain our variable of interest.

What Causes Accidents? (Revisited)

Last week, we modeled how the quantity of streetlights on a road reduced car accidents. But our model only explained SOME of what we observed (R² = 0.27). Clearly, other factors affect the number of accidents on our roads.

Probably,

It probably makes a big difference how many cars are travelling that stretch of road, so we should probably add that to our model
Accidents are also more likely to occur on windy roads. That may help explain our totals
and so on, until we have our full model.

Which Variables Matter?

Multiple regression is so powerful because it provides a full accounting of how much each variable affects our outcome.

When we call:

summary(
  lm(data = Auto_Accidents,formula = Accidents ~ Streetlights_per_Mile + Traffic + Bends_per_Mile + Shoulder_Width)
)

we are given the coefficient value for each variable and whether there’s strong evidence that the variable is statistically significant.

## 
## Call:
## lm(formula = Accidents ~ Streetlights_per_Mile + Traffic + Bends_per_Mile + 
##     Shoulder_Width, data = Auto_Accidents)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9543 -1.4113 -0.0527  1.3809  7.7558 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.652e+01  2.722e-01  60.694   <2e-16 ***
## Streetlights_per_Mile -1.819e-01  1.277e-02 -14.242   <2e-16 ***
## Traffic               -4.185e-04  1.679e-05 -24.923   <2e-16 ***
## Bends_per_Mile        -2.278e-02  3.822e-02  -0.596   0.5512    
## Shoulder_Width        -1.175e-01  5.332e-02  -2.204   0.0277 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.077 on 995 degrees of freedom
## Multiple R-squared:  0.5661,	Adjusted R-squared:  0.5644 
## F-statistic: 324.6 on 4 and 995 DF,  p-value: < 2.2e-16

Each variable is assigned a p-value, which assesses the likelihood that it a given variable’s coefficient is actually zero (ie. it has no effect on the outcome).

In the case of our data, we find that the number of streetlights, the overall traffic level, and should width all serve as potentially valid predictors of how dangerous a stretch of road will be.

Wait, What About the Streetlights?

If you’re paying close attention, you’ve noticed that the coefficient for Streetlights_per_Mile shrank from 0.302 to 0.167 when we moved from a simple regression to multiple regression. Why is that?

Essentially, our newly added variables explained some of the variation in crash rates better than streetlight density alone. Once these new factors are accounted for, we end up with a better estimator. We see our R² jumped from 0.27 (simple linear regression) to 0.56 (multiple regression).

Multiple Regression in Daily Life

One practical lesson of multiple regression that you can take away is:

Know the factors that drive your business.

If you are lucky enough to have a dataset available that you can do regression on, it doesn’t mean you should. Or, at least, you should consider what is MISSING from that data before plowing ahead.

The Case of the Missing Variable

Early in my career, I was part of a project to determine why a major retailer was experiencing a surge in credit card chargebacks.

We had plenty of data on these transactions, so we sliced and diced (and even pulverized) the data for weeks, but nothing popped out.

The answer finally came once we visited stores.

While observing an associate ringing someone up, we witnessed the application get “stuck” after the card was swiped. The associate then re-swiped the card, believing the transaction hadn’t gone through. Aha!

We now had our explanation as to why customers were disputing charges - they were inadvertently being double-charged.

The omitted variable: Wi-Fi

The ultimate culprit in this story was bad Wi-Fi connectivity within certain stores. If the Wi-Fi cut out during a transaction, the transaction would usually be retried, and both transactions would post once the Wi-Fi connection was re-established.

We then would end up with someone being charged TWICE. No wonder they charged it back!

“You forgot the ‘FREE’ part…”

If our initial dataset had the data element “WIFI_STRENGTH”, it would’ve been very easy to identify the source of our troubles. But, recognizing that this variable was necessary was actually the biggest hurdle in the analysis.

Summing Up

Multiple regression is an incredibly powerful tool for making decisions. Done correctly, it can distill huge volumes of information into a format that can both highlight critical variables AND predict the impact of changes.

But, this relies on us selecting the correct variables to model. Much like correlation, there is no guarantee that our model is correct, even when we get what appear to be highly significant results.

Beware of recommendations from analysts that are completely walled off from the business. They may know regression techniques very well, but can remain oblivious to the critical variables that are MISSING from the dataset. And that mistake can completely invalidate your results.

Next week, we will discuss another type of regression technique specifically for determining how things change over time, using time series analysis.

===========================

R code used to generate plots:

Multiple Regression Model

library(data.table)
set.seed(060124)

Auto_Accidents <- rpois(1000,10) |> as.data.table() 
colnames(Auto_Accidents) <- "Accidents"
Auto_Accidents[,Streetlights_per_Mile:= abs(20 - Accidents + rnorm(1000, 0, 5))]
Auto_Accidents[,Traffic:= abs(1000 * (20 - Accidents + rnorm(1000, 0, 3)))]
Auto_Accidents[,Bends_per_Mile:= rpois(1000,3)]
Auto_Accidents[,Shoulder_Width:= rpois(1000,5/Accidents) +  rnorm(1000, 3, 1)]

summary(
  lm(data = Auto_Accidents,formula = Accidents ~ Streetlights_per_Mile + Traffic + Bends_per_Mile + Shoulder_Width)
)