How to Decide: 3 - Missing Data

(or “Why NULLs Matter”)

Some of the most difficult decisions are the ones where our picture has missing pieces.

Because missing data can have an outsized effect on our results, it is critical to know what to do when you see holes in your data. Otherwise, we risk making bad decisions.

When Data Goes Missing

Data can be missing for a many reasons.

It might’ve been an issue where the data got corrupted, or the respondent refused to answer. Maybe the response was deemed invalid, and was dropped as “bad” data. It could even be that someone spilled coffee on that page of notes, rendering them unreadable.

Often, it’s useful to simply boil all of these options down to 2 very basic flavors:

Missing randomly
Missing systematically

This distinction is important because it tells you how to deal with your missing data.

Missing Randomly

The example of coffee falling on our data and splotching the results would likely be an example of “missing randomly”.

Here, a random set of observations gets removed (wherever the coffee fell) - there is nothing driving the loss of these observations other than pure random chance.

This is easy to deal with - because there is no underlying pattern in our missing data, it acts much like a random sample. That means our estimates should still reach the true answer, eventually. Therefore, missing data poses no issues, other than giving us a smaller sample size.

Unfortunately, most of the time, data IS missing for a reason.

Missing Systematically

If instead, our missing data tends to have something in common, it would be systematically missing. An example of this would be someone refusing to answer a survey about their income because they don’t want to divulge how rich they are.

In this case, it is obvious that loss of data is limited to only certain situations, and that it will bias what observations remain in our dataset.

As you might guess, making inferences on this type of data can quickly become problematic.

Missing Data in Daily Life

One practical lesson that you can take away is:

Always correct missing data FIRST

People often see NULL values in their data and just keep going. Why spend a bunch of time worrying about a small number of missing observations?

Unfortunately, these missing values can make a big difference in your outcome.

How Many Customers Do You See?

Let’s say you want to get the average revenue of a set of customers:

CUSTOMER	DATE	DOLLAR_AMOUNT
1	2025-07-31	48.65
2	2025-08-02	NA
3	2025-08-01	97.98
4	2025-08-03	300.35
5	2025-07-31	384.46
6	2025-08-02	NA
7	2025-07-31	NA

Here, we see that only 4 out of the 7 customers have transactions.

SO, should you divide by 4, or by 7? Choose wisely - your decision will have a big effect our final result.

Average revenue/customer (4 customers) = $207.86

Average revenue/customer (7 customers) = $118.78

If you’re struggling to decide which is right, you’re not alone. Statistics has no magical tool to tell us which is correct, it all comes down to the individual problem.

Our observations could be missing because the customer didn’t purchase anything. OR, maybe a database issue caused all transaction amounts under under $20 to be blanked out. Until we know the cause of the missing data, we can’t know how to calculate our result.

The Truth about Working with Data

It may sound surprising, but USING data is often quite simple. Generally, our results just boil down to simple math. (Everyone reading this post will know how to calculate an average, for instance).

What often gets complicated is correctly setting up the math problem.

In the example above, calculating the average is easy, once we know which rows to include. But answering THAT question isn’t so simple - you have to know how your data is generated, and what missing data indicates.

This type of problem is often why data scientists spend most of their time cleaning data before they can actually use it, and why missing data should be one of the first things you look at when doing an analysis.

Summing Up

There is a very clear bias in most data analyses - we focus on the things we have data for. But what about the things we didn’t measure?

Always consider what data you DON’T have. An incomplete dataset is often just a form of non-representative sample - and since bad samples are a great way to get bogus results, we should always be extremely careful when drawing conclusions from them.

The issue is particularly vexing because it hides in plain sight. You have to go seek out what is missing, maybe based on nothing more than a hunch. But if you fail to account for what’s missing, you may come to a decision that is badly wrong.

Often it’s what you DON’T have that matters.

Up Next: Making the Right Choice

Next week, we’ll wrap up Summer of Stats 2.0 covering why ethics is baked into every data project. Read on to learn how to be a Force for Good!

===========================

R code used to generate plots:

A Simple Average

library(knitr)
library(kableExtra)
library(formattable)
library(lubridate)
library(data.table)

set.seed(080625)
today_date <- Sys.Date()

### Create synthetic MAR data
avg_rev <- data.table("CUSTOMER" = seq(1,7),
                      "DATE" = today_date-runif(7,0,5),
                      "DOLLAR_AMOUNT" = runif(7,0,550) |> round(2))

avg_rev[,DOLLAR_AMOUNT:=ifelse(DOLLAR_AMOUNT %like% '2', NA,DOLLAR_AMOUNT)]

### Display data as a table
options(kableExtra.html.bsTable = TRUE)
kable(avg_rev, align = c("c","c")) |> kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)