The Taxi Cab Problem: Why 80% Reliable Witnesses Are Usually Wrong

The Question

A cab was involved in a hit-and-run accident at night. Two cab companies operate in the city: the Green company and the Blue company. You are given the following facts:

85% of the cabs in the city are Green, and 15% are Blue.
A witness identified the hit-and-run cab as Blue.
The court tested the witness under the same conditions that existed on the night of the accident and found that the witness correctly identifies each color 80% of the time and fails 20% of the time.

Given this information, what is the exact probability that the cab involved in the accident was actually Blue?

This problem was formulated by Amos Tversky and Daniel Kahneman — the architects of behavioral economics — as a demonstration of one of the most durable cognitive failures in human reasoning: the Base Rate Fallacy. It appears in quant interviews at Goldman Sachs, Morgan Stanley, and Citadel. It appears in law school evidence courses. And it describes a class of reasoning error that leads to wrongful convictions, failed corporate audits, and flawed risk assessments every single day.

The answer is not 80%. The answer is approximately 41.4%. The cab was more likely Green — even with an 80% accurate witness swearing under oath that it was Blue.

The Intuition Trap: The Base Rate Fallacy

Most people — including trained attorneys, judges, and expert witnesses — immediately answer 80%. The reasoning is intuitive: the witness is 80% accurate, the witness says it was Blue, therefore there is an 80% chance the cab was Blue. This anchors entirely on the witness's stated reliability and ignores everything else.

What it ignores is the prior — the underlying distribution of cabs in the city. Green cabs are overwhelmingly more common: 85 out of every 100 cabs on the road are Green. This base rate creates an asymmetric arithmetic that most human intuition is completely blind to. Consider what actually happens across 10,000 accidents involving a random cab:

10,000 accidents — applying the base rates and witness error rate: Of 10,000 accidents: ├─ 8,500 involve a Green cab (85% base rate) │ ├─ 6,800 witness correctly says "Green" (80% accuracy) │ └─ 1,700 witness incorrectly says "Blue" (20% error rate) │ └─ 1,500 involve a Blue cab (15% base rate) ├─ 1,200 witness correctly says "Blue" (80% accuracy) └─ 300 witness incorrectly says "Green" (20% error rate) ────────────────────────────────────────────────────────────── Times the witness says "Blue": Correct Blue identifications: 1,200 (cab was actually Blue) False Blue identifications: 1,700 (cab was actually Green) Total "Blue" claims: 2,900 P(actually Blue | witness says Blue) = 1,200 / 2,900 ≈ 41.4%

The arithmetic is unambiguous. Of the 2,900 times a witness makes a "Blue" identification under these conditions, only 1,200 of those identifications are correct. The other 1,700 are Green cabs that the witness mistook for Blue. Because Green cabs are so prevalent, the sheer volume of false Blue calls swamps the correct ones — even at 80% accuracy. The witness is right just 41.4% of the time, and the cab is more likely Green (58.6%) than Blue.

This is the Base Rate Fallacy in its purest form. Kahneman and Tversky documented it systematically in the 1970s, demonstrating that humans consistently replace a question about conditional probability — "what is the probability the cab is Blue, given the witness said so?" — with a simpler but wrong question: "how reliable is the witness?" The reliability of the witness is one input into the calculation. It is not the answer.

The Core Error

The Base Rate Fallacy is the act of answering a conditional probability question by focusing entirely on the reliability of the evidence while ignoring the prior probability of the event. The witness's 80% accuracy rate is a likelihood — it tells you how often this type of evidence appears given the event. It does not directly tell you how probable the event is given this evidence. That calculation requires Bayes' Theorem, which explicitly integrates the prior.

The Mathematical Proof

The precise answer comes from Bayes' Theorem. We want to find the posterior probability that the cab is Blue, given that the witness identified it as Blue. This is a conditional probability calculation, and it must account for both the witness's reliability and the base rate of Blue cabs.

Define the events as follows. Let $B$ be the event that the cab is Blue and $G$ be the event that the cab is Green. Let $W_B$ be the event that the witness says the cab is Blue.

The prior probabilities — the base rates of the two cab companies — are:

\[ P(B) = 0.15 \]

\[ P(G) = 0.85 \]

The witness's reliability translates into the following conditional likelihoods. The probability the witness says "Blue" given the cab actually is Blue is 0.80 (the correct identification rate). The probability the witness says "Blue" given the cab is actually Green is 0.20 (the error rate — the witness mistakes a Green cab for a Blue one):

P(W_B \mid B) = 0.80

P(W_B \mid G) = 0.20

Bayes' Theorem gives us the posterior probability — the probability the cab is Blue given that the witness said it was Blue — as:

P(B \mid W_B) = \frac{P(W_B \mid B)\, P(B)}{P(W_B \mid B)\, P(B) + P(W_B \mid G)\, P(G)}

The denominator is the total probability of the witness making a "Blue" identification — regardless of the cab's actual color. It sums over both ways the witness can say "Blue": correctly identifying a Blue cab, or incorrectly identifying a Green one. Plugging in:

P(B \mid W_B) = \frac{0.80 \times 0.15}{(0.80 \times 0.15) + (0.20 \times 0.85)}

= \frac{0.12}{0.12 + 0.17} = \frac{0.12}{0.29} \approx 0.4138

The result: there is a 41.38% probability the cab was actually Blue, and a 58.62% probability it was Green. Despite an 80% reliable witness testifying under oath that the cab was Blue, it is statistically more likely that the witness is wrong.

Scenario	Base Rate	Witness Says "Blue"	Joint Probability
Cab is Blue, witness is correct	15%	80%	0.15 × 0.80 = 0.12
Cab is Green, witness is wrong	85%	20%	0.85 × 0.20 = 0.17
Total P(witness says "Blue")			0.12 + 0.17 = 0.29
P(Blue \| witness says "Blue")			0.12 / 0.29 ≈ 41.4%

It is worth making the structure of the calculation explicit. The numerator is the probability that both things are true simultaneously: the cab is Blue and the witness correctly identifies it as Blue. The denominator is the total probability of the witness saying "Blue" — which includes both correct and incorrect identifications. We are conditioning on the witness's statement and asking what fraction of the time that statement is accurate. The answer is determined by the ratio of correct "Blue" calls to total "Blue" calls, which is why the base rate is decisive.

A useful intuition: the witness's 80% accuracy rate is symmetric — it applies equally to both colors. But the base rates are sharply asymmetric. Green cabs appear at a rate more than five times higher than Blue cabs. A 20% error rate applied to a population of 8,500 Green cabs generates 1,700 false Blue identifications. An 80% accuracy rate applied to a population of only 1,500 Blue cabs generates just 1,200 correct ones. The false positives outnumber the true positives. This is the mathematical mechanism behind the result, and it generalizes to every domain where rare events are being detected by imperfect instruments.

"An 80% accurate detector applied to a rare event will produce more false positives than true positives. This is not a flaw in the detector — it is arithmetic. Ignoring it is the Base Rate Fallacy."

Python Simulation: 1,000,000 Trials

The Bayesian result can be confirmed empirically with a straightforward Monte Carlo simulation. We generate 1,000,000 accidents, assign each a cab color using the 85/15 base rate, apply the witness's 80% accuracy rate to each observation, and then filter to only the trials where the witness said "Blue." The fraction of those trials where the cab was actually Blue converges to exactly the theoretical 41.38%.

        Python
        taxi_cab_simulation.py
    

import random

def taxi_cab_trial() -> tuple[bool, bool]:
    """Simulate one taxi cab accident and witness observation.

    Returns:
        (cab_is_blue, witness_says_blue): truth and witness claim as booleans.
    """
    # Assign cab color using the 85/15 base rate
    cab_is_blue = random.random() < 0.15

    # Apply witness accuracy: 80% correct, 20% wrong
    witness_correct = random.random() < 0.80
    witness_says_blue = cab_is_blue if witness_correct else not cab_is_blue

    return cab_is_blue, witness_says_blue


def simulate_taxi_cab(n_trials: int = 1_000_000) -> dict:
    """Run n_trials and return posterior probability statistics."""
    witness_said_blue = 0
    actually_blue     = 0

    for _ in range(n_trials):
        cab_is_blue, witness_says_blue = taxi_cab_trial()
        if witness_says_blue:
            witness_said_blue += 1
            if cab_is_blue:
                actually_blue += 1

    posterior = actually_blue / witness_said_blue
    return {
        "trials":             n_trials,
        "witness_said_blue":  witness_said_blue,
        "actually_blue":      actually_blue,
        "posterior_p_blue":   posterior,
        "posterior_p_green":  1 - posterior,
    }


random.seed(42)
results = simulate_taxi_cab(n_trials=1_000_000)

print("=== Taxi Cab Problem — 1,000,000 Trial Monte Carlo ===")
print(f"Total trials:             {results['trials']:,}")
print(f"Witness said 'Blue':      {results['witness_said_blue']:,}")
print(f"Cab actually was Blue:    {results['actually_blue']:,}")
print(f"Cab actually was Green:   {results['witness_said_blue'] - results['actually_blue']:,}")
print()
print(f"P(Blue  | witness says Blue): {results['posterior_p_blue']:.4f}  (exact: 0.4138)")
print(f"P(Green | witness says Blue): {results['posterior_p_green']:.4f}  (exact: 0.5862)")

Actual output from running this simulation with random.seed(42):

        Output
        stdout
    

=== Taxi Cab Problem — 1,000,000 Trial Monte Carlo ===
Total trials:             1,000,000
Witness said 'Blue':        289,847
Cab actually was Blue:      120,042
Cab actually was Green:     169,805

P(Blue  | witness says Blue): 0.4142  (exact: 0.4138)
P(Green | witness says Blue): 0.5858  (exact: 0.5862)

The simulation confirms the theory precisely. Of the 289,847 times the witness identifies a cab as Blue across 1,000,000 trials, the cab was actually Green 169,805 times — nearly 59% of the cases. The deviations from the exact theoretical values (0.4142 versus 0.4138, and 0.5858 versus 0.5862) are pure sampling noise well within the expected standard error of $\sqrt{p(1-p)/n}$ at this trial count.

The key observation from the output: the witness said "Blue" approximately 290,000 times in one million trials — about 29% of the time, which exactly matches the denominator of Bayes' Theorem: $P(W_B) = 0.12 + 0.17 = 0.29$. Of those 290,000 identifications, roughly 120,000 were correct and 170,000 were false positives. The simulation is not a shortcut — it is independent verification of the algebra.

Litigation Application: When Juries Get the Math Wrong

The Taxi Cab Problem is not an abstract curiosity. It is the operating model for how human intuition evaluates evidence in courtrooms, boardrooms, and regulatory proceedings — and it consistently produces the wrong answer. Kahneman and Tversky's research showed that even trained professionals, when presented with base rate information alongside witness reliability data, systematically ignore the prior and anchor on the reliability statistic. This is not a matter of education or intelligence. It is a structural feature of how the human mind processes conditional probability under uncertainty.

In criminal litigation, the most direct application is eyewitness testimony. A witness with a documented 80% identification accuracy is presented as highly reliable evidence. Jurors hear "80% accurate" and infer "80% probability of guilt." But the actual posterior probability of guilt depends critically on the base rate — in this context, how many individuals in the relevant population could plausibly have committed the crime. When that population is large (as it almost always is), or when the base rate of guilt for any given suspect is low (as it almost always is), the math produces the same structure as the taxi cab problem: the witness's identification is far less probative than its accuracy statistic implies.

Breathalyzer evidence carries the same structure. A Breathalyzer instrument with a 95% accuracy rate sounds definitive. But "accuracy" is often specified as sensitivity — the probability the instrument reads positive given the subject is actually impaired. The critical quantity for adjudication is the inverse: the probability the subject is impaired given a positive reading. That calculation requires the base rate of impaired driving in the population of individuals who are tested, which is not 50% and not 95%. In standard roadside screening scenarios, accounting for the realistic base rate of impairment in stopped drivers substantially lowers the posterior probability even at high instrument accuracy. Juries are rarely presented with this calculation.

In corporate litigation and eDiscovery, technology-assisted review systems flag documents as "responsive" or "privileged" at rated accuracy levels. A document review system marketed as 90% accurate sounds like a reliable filter. Whether it is reliable enough to be defensible in court depends on the base rate of responsive documents in the corpus. If 5% of a corpus is actually responsive, a 90% accurate classifier will generate approximately as many false positives as true positives — meaning half the documents flagged as responsive were not. The attorneys relying on the output face exactly the taxi cab problem, and their experts need to present the math, not just the accuracy rating.

In financial services, the same structure governs fraud detection, credit default prediction, and audit sampling. A credit model with 90% accuracy deployed against a population where 3% of borrowers default will generate a substantial number of false positives. A fraud detection system with 99% specificity applied to a payment processor handling billions of transactions will still produce tens of millions of false flags annually. Every one of these applications is a Bayesian calculation dressed in domain-specific language. Every one of them is broken when analysts skip the prior and anchor on the headline accuracy statistic.

The litigation business case is specific: attorneys and their expert witnesses who quantify these posteriors — who present a jury with the actual conditional probability calculation rather than the raw reliability statistic — can neutralize evidence that appears overwhelming on its face. And attorneys who do not understand this framework will consistently over-rely on evidence that appears reliable but is probabilistically thin. High-stakes litigation in domains touching statistics, forensics, or technology-assisted review requires this framework. Gut instinct on conditional probability is demonstrably, mathematically broken.

White Oak Intelligence

Probability Theory

White Oak Intelligence builds quantitative financial models, data infrastructure, and custom software for middle-market operators and investors in Raleigh, NC.

The Question

The Intuition Trap: The Base Rate Fallacy

The Mathematical Proof

Python Simulation: 1,000,000 Trials

Litigation Application: When Juries Get the Math Wrong

White Oak Intelligence

Apply Probabilistic Modeling to Real Decisions

Need Probabilistic Models That Hold Up to Scrutiny?

Continue Reading

The Monty Hall Problem: Why Switching Doors Wins 2/3 of the Time

Recursive Probability: Solving the Amoeba Extinction Problem

Absorbing Markov Chains: Why E[HH] = 6 and E[HTH] = 10