How to Lie: 2 - Sampling - Summer of Stats

(or “Why Surveys Lie”)

When posed with a tough question, sampling is one of our greatest super-powers.

The “S” is actually for “Sampling”…

We’ve talked at length about why - the random sample forms the basis of nearly all statistics. But what happens when our sample ISN’T random?

Things become much more tricky, and we encounter another popular method for misleading people.

How to Lie (Recap)

As a reminder, there are 3 common ways to lie with statistics:

Ignoring variability
Using a non-representative sample ← YOU ARE HERE
Choosing a bad estimator

Let’s talk today about what happens when people abuse the assumption of random sampling.

Sampling in a Perfect World

Last summer, I detailed an example about estimating the height of a population using a random sample of 100 people.

This sample returned an estimate that quickly converged with the actual population mean:

The beauty of a random sample is clear - as we see more data, our estimate about that population tends to become better and better.

Sampling in an Imperfect World

As a contrast, when we do a poor job of sampling, there is no guarantee that our answer will be any good.

Let’s see what happens when we collect height data from a non-representative sample - here, we sampled ONLY basketball players (who tend to be much taller than a normal person):

Our sample doesn’t represent the population particularly well, and so our statistic never arrives at the true answer, no matter how much data we collect.

Sampling in the Real World

Now for the bad news: in the real world, getting a truly random sample is generally difficult or impossible.

It might be as simple as collecting your data at the wrong place (a basketball tournament), or at the wrong time (during the day, when many adults are working), or using the wrong approach (using an internet poll, which may miss older people).

These decisions seem incredibly minor at first glance, but could be enough to invalidate our assumption of random sampling.

Bad Samples in Daily Life

One practical lesson that you can take away is:

Always ask how the data was collected

Many times, we are forced to work with a “pretty random” sample, with some imperfections.

But, as we’ll see, people can take advantage of this wiggle room to impose their preferred answer on the data.

Some Really Great Rocks

Many people base their purchase decisions off user reviews for products (I certainly do). The catch is that online reviews are not a perfectly random sample of all purchases.

For example, Amazon has significant issues with fake reviews - by inducing users to leave a glowing review for a given product through incentives, sellers are able to artificially enhance (ie. lie about) how good their product really is.

Since our review scores no longer come from a random sample, there is no guarantee the average review score represents what customers really think.

What we end up with is an endless sea of products with 4.5 stars - many of which may or may not be any good.

One of my hobbies: reviewing rocks on Amazon!

Why is Everyone Getting Smarter?

A more interesting example of the impact of a bad sample popped up in a recent conversation with a family member.

The talk revolved around college admissions, and I was shocked to hear how high the average SAT score had become at our local state college (OSU). I did some digging, and confirmed the surprising rise in scores:

Year	Avg SAT Score
2015	1280
2024	1400

Source

However, things aren’t always as they seem.

Something else that happened during this time - in 2020, OSU dropped the requirement to submit a standardized test score. What that means practically is that these “average” SAT score no longer represent ALL incoming students.

Think about it; if your score was BELOW average, would you choose to submit a test score?

This creates a dynamic where lower-scoring students opt to not include their score (as it may hurt their case for admission). As a result, the lower scores magically disappear from the distribution, and we see the average score drift upwards.

So, someone could plausibly say: “Dropping test score requirements actually improved the quality of students!”, and have data to back it up (because average SAT scores rose).

However, all of it would be complete bunk - that statistic is based on a non-random sample containing clear bias.

(Post-script: all of this will change again in 2026, when OSU reinstates its test score requirement)

Summing Up

Samples are incredibly powerful tools for many problems. They are also vulnerable to manipulation.

It pays to be very careful any time you rely on a sample of data. Why? Because if someone can control which data are captured, they can also control the result.

Bad samples come in all shapes and sizes. Even if your dataset is HUGE, that is no guarantee of accuracy. As we saw in our example when our sample isn’t representative, there’s no guarantee our answer is correct, no matter how much data we collect.

As the saying goes: “Garbage in, Garbage out”.

Up Next: Bad Statistics

Next week, we’ll wrap up the series on lying by exploring how people can bend the truth by using the wrong estimator.

See how changing your statistic can completely change the story your data tells.

===========================

R code used to generate plots:

Animated Plots

library(data.table)
library(ggplot2)
library(gganimate)
set.seed(060124)

### Generate 10000 sample height records, using mean = 178.4, std = 7.59
simulated_pop <- cbind(1:10000, rnorm(10000, 178.4, 7.59)) |> as.data.table()
simulated_mean <- simulated_pop[,mean(V2)] |> as.data.table()

### Pull a random sample
good_sample <- simulated_pop[sample(.N,100)]

# Add calculations for cumulative mean
good_sample <- good_sample[,cum_V2 := cumsum(V2)]
good_sample <- good_sample[,Row_Index := .I]
good_sample <- good_sample[,cum_avg_V2 := cum_V2/Row_Index]


### Pull a non-random sample from a basketball league 
# (assume these people all are in top 10% for height...)
bad_sample <- simulated_pop[V2 > quantile(simulated_pop$V2, probs = .90)][sample(.N,100)]

# Add calculations for cumulative mean
bad_sample <- bad_sample[,cum_V2 := cumsum(V2)]
bad_sample <- bad_sample[,Row_Index := .I]
bad_sample <- bad_sample[,cum_avg_V2 := cum_V2/Row_Index]


### Plot good sample
good_sample_convergence <- ggplot(good_sample, aes(x=V2)) + 
  geom_histogram(data = simulated_pop, aes(x=V2), fill="grey", position = "identity") + 
    theme_minimal() + 
      theme(axis.title.y=element_blank(),
            axis.title.x=element_blank(),
            axis.ticks.x=element_blank(),
            plot.title = element_text(size = 20, face = "bold")) +
    ggtitle("Estimate of Average Height (Random Sample)") +     
  geom_vline( xintercept = simulated_pop[,mean(V2)], lwd = 1, col = "darkred")  +
  geom_label(data = simulated_mean, aes(x = V1, y = 1000, label = paste0(round(V1, digits = 2), " cm") ), size = 5, fill = "lightyellow") +
    geom_point(data = good_sample, mapping = aes(y = 810, x = cum_avg_V2), 
               size = 20, color = 'red', shape = '|', alpha = .4) +
  geom_label(data = good_sample, aes(x = cum_avg_V2, y = 800, label = round(cum_avg_V2, digits = 1)), size = 5, fill = "lightyellow") +
      geom_label(data = good_sample, aes(x = 160, y = 800, label = paste("Sample Size: ", Row_Index)), size = 8, fill = "white") +
      geom_text(data = simulated_mean, aes(x = 200, y = -25, label = "Summer of Stats"), col="grey80", size = 4) +
 transition_reveal(Row_Index, keep_last = FALSE) 

animate(good_sample_convergence, duration = 8,end_pause = 30)


### Plot bad sample
bad_sample_convergence <- ggplot(bad_sample, aes(x=V2)) + 
  geom_histogram(data = simulated_pop, aes(x=V2), fill="grey", position = "identity") + 
    theme_minimal() + 
      theme(axis.title.y=element_blank(),
            axis.title.x=element_blank(),
            axis.ticks.x=element_blank(),
            plot.title = element_text(size = 20, face = "bold")) +
    ggtitle("Estimate of Average Height (Basketball Sample)") + 
    geom_vline( xintercept = simulated_pop[,mean(V2)], lwd = 1, col = "darkred")  +
  geom_label(data = simulated_mean, aes(x = V1, y = 1000, label = paste0(round(V1, digits = 2), " cm") ), size = 5, fill = "lightyellow") +
    geom_point(data = bad_sample, mapping = aes(y = 810, x = cum_avg_V2), 
               size = 20, color = 'red', shape = '|', alpha = .4) +
  geom_label(data = bad_sample, aes(x = cum_avg_V2, y = 800, label = round(cum_avg_V2, digits = 1)), size = 5, fill = "lightyellow") +
      geom_label(data = bad_sample, aes(x = 160, y = 800, label = paste("Sample Size: ", Row_Index)), size = 8, fill = "white") +
      geom_text(data = simulated_mean, aes(x = 200, y = -25, label = "Summer of Stats"), col="grey80", size = 4) +
 transition_reveal(Row_Index, keep_last = FALSE) 

animate(bad_sample_convergence, duration = 8,end_pause = 30)