Home / Intelligence Log / Quantitative Finance Quantitative Finance

Variance Testing in Forecasting: When Your Model Stops Being Reliable

Why MAPE Misleads

Mean Absolute Percentage Error is the default metric for forecast evaluation in most business contexts. It is easy to explain: if your MAPE is 8%, your model is wrong by 8% on average. That simplicity is also its critical flaw.

MAPE is undefined when actuals are zero — which happens constantly in revenue series with seasonal gaps, new product launches, or promotional periods. More subtly, it penalizes over-forecasts more severely than under-forecasts by construction: a 50% under-forecast has a maximum error contribution of 100%, while an over-forecast of equal magnitude can produce an error of 200% or more. This asymmetry means MAPE-optimized models systematically bias toward underestimating demand — a direction that is rarely operationally preferable.

The Core Problem

A model can have a low MAPE and still be useless in practice. If it is consistently wrong in the same direction, if its errors correlate with past errors, or if it performs worse than a naive benchmark, those failures are invisible in a single-metric MAPE report.

The Four-Metric Framework

A rigorous forecast evaluation requires at minimum four metrics, each measuring a different failure mode. Used together, they reveal whether a model is accurate in magnitude, unbiased, better than a naive baseline, and not systematically gaming a particular error measure.

Metric What It Measures Key Property
MAPE Mean percentage error magnitude Intuitive but unstable at low actuals
RMSE Root mean squared error Penalizes large errors; same units as the series
MASE Mean absolute scaled error vs. seasonal naïve Scale-free; MASE > 1.0 means worse than naïve
Theil's U RMSE ratio vs. no-change naïve U > 1.0 means model is worse than doing nothing

Python Implementation

The function below computes all four metrics from actuals and forecasts arrays. MASE uses a seasonal naïve benchmark with a configurable seasonal_period — for monthly data the default of 12 compares each forecast to the value from the same month one year prior. When the series is shorter than one full season, it falls back to a one-step naïve benchmark.

Python forecast_metrics.py
import numpy as np
from typing import Dict

def compute_forecast_metrics(
    actuals: np.ndarray,
    forecasts: np.ndarray,
    seasonal_period: int = 12,
    epsilon: float = 1e-8
) -> Dict[str, float]:

    errors = actuals - forecasts
    abs_errors = np.abs(errors)

    # MAPE — skip near-zero actuals to avoid division instability
    mask = np.abs(actuals) > epsilon
    mape = np.mean(abs_errors[mask] / np.abs(actuals[mask])) * 100

    # RMSE
    rmse = np.sqrt(np.mean(errors ** 2))

    # MASE — seasonal naïve benchmark
    if len(actuals) > seasonal_period:
        naive_errors = np.abs(actuals[seasonal_period:] - actuals[:-seasonal_period])
    else:
        naive_errors = np.abs(np.diff(actuals))  # one-step naïve fallback

    naive_mae = np.mean(naive_errors)
    mase = np.mean(abs_errors) / (naive_mae + epsilon)

    # Theil's U — compare model RMSE to no-change naïve RMSE
    naive_rmse = np.sqrt(np.mean((actuals[1:] - actuals[:-1]) ** 2))
    theil_u = rmse / (naive_rmse + epsilon)

    return {
        'mape':    round(float(mape), 4),
        'rmse':    round(float(rmse), 4),
        'mase':    round(float(mase), 4),
        'theil_u': round(float(theil_u), 4),
    }

Residual Analysis and the Ljung-Box Test

A well-specified forecast model should produce residuals that are white noise: random, uncorrelated, and centered near zero. If residuals show autocorrelation — if this period's error predicts next period's error — the model is leaving systematic information on the table. That pattern is detectable and exploitable, which means the model is not doing its job.

The Ljung-Box test is the standard statistical tool for detecting residual autocorrelation. It tests the null hypothesis that residuals up to lag k are white noise. A p-value below 0.05 rejects that hypothesis and confirms the model has structural problems that cannot be patched by recalibration alone.

Python forecast_metrics.py
from statsmodels.stats.diagnostic import acorr_ljungbox

def residual_analysis(
    actuals: np.ndarray,
    forecasts: np.ndarray,
    lags: int = 10,
    significance: float = 0.05
) -> Dict:

    residuals = actuals - forecasts
    lb_result = acorr_ljungbox(residuals, lags=[lags], return_df=True)
    lb_stat  = float(lb_result['lb_stat'].iloc[-1])
    lb_pval  = float(lb_result['lb_pvalue'].iloc[-1])
    autocorrelated = lb_pval < significance

    residual_mean    = float(np.mean(residuals))
    residual_std     = float(np.std(residuals))
    max_abs_residual = float(np.max(np.abs(residuals)))

    if autocorrelated and abs(residual_mean) > residual_std * 0.5:
        diagnosis = "RETRAIN: systematic bias with autocorrelation"
    elif autocorrelated:
        diagnosis = "RETRAIN: autocorrelated residuals indicate model misspecification"
    elif abs(residual_mean) > residual_std * 0.5:
        diagnosis = "RECALIBRATE: bias without autocorrelation"
    else:
        diagnosis = "PASS: residuals appear well-behaved"

    return {
        'ljung_box_stat':   round(lb_stat, 4),
        'ljung_box_pvalue': round(lb_pval, 4),
        'autocorrelated':   autocorrelated,
        'residual_mean':    round(residual_mean, 4),
        'residual_std':     round(residual_std, 4),
        'max_abs_residual': round(max_abs_residual, 4),
        'diagnosis':        diagnosis,
    }

Retrain vs. Recalibrate Decision Table

Not every model failure requires a full retrain. Retraining means rebuilding the model from scratch on a new or expanded dataset — a significant undertaking for complex models. Recalibration means adjusting existing parameters, updating intercepts, or applying a bias correction factor. Knowing which intervention is appropriate requires reading the diagnostic signals together.

Diagnostic Signal Recommended Action Rationale
MASE > 1.0 Retrain Model underperforms a naïve baseline — structural failure
Autocorrelated + bias Retrain Model is missing a systematic component; recalibration cannot fix this
Non-autocorrelated + bias Recalibrate Model structure is correct; apply bias correction or update intercept
All metrics passing Monitor Continue scheduled evaluation; no intervention needed
Theil's U > 1.0 despite low MAPE Retrain Model exploits MAPE asymmetry; real-world performance is worse than reported

"A forecast model that passes its MAPE target while underperforming a naïve benchmark is not a model that works — it is a model that has learned to game a poorly chosen metric."

Business Impact

What Rigorous Forecast Validation Delivers

Fewer Costly Surprises

Forecasts that detect their own degradation before it compounds into operational decisions prevent the inventory overages, cash flow shortfalls, and staffing errors that follow from using a model past its useful life.

Defensible Model Governance

A documented validation framework — metrics, thresholds, and a structured retrain trigger — satisfies the model governance requirements that PE-backed companies and regulated entities increasingly face from lenders and auditors.

Better Capital Decisions

When a board or lender asks how confident you are in your revenue forecast, the answer should not be "the model said so." A residual analysis and four-metric report is the kind of evidence that supports a capital decision.