Mean Absolute Percentage Error is the default metric for forecast evaluation in most business contexts. It is easy to explain: if your MAPE is 8%, your model is wrong by 8% on average. That simplicity is also its critical flaw.
MAPE is undefined when actuals are zero — which happens constantly in revenue series with seasonal gaps, new product launches, or promotional periods. More subtly, it penalizes over-forecasts more severely than under-forecasts by construction: a 50% under-forecast has a maximum error contribution of 100%, while an over-forecast of equal magnitude can produce an error of 200% or more. This asymmetry means MAPE-optimized models systematically bias toward underestimating demand — a direction that is rarely operationally preferable.
A model can have a low MAPE and still be useless in practice. If it is consistently wrong in the same direction, if its errors correlate with past errors, or if it performs worse than a naive benchmark, those failures are invisible in a single-metric MAPE report.
A rigorous forecast evaluation requires at minimum four metrics, each measuring a different failure mode. Used together, they reveal whether a model is accurate in magnitude, unbiased, better than a naive baseline, and not systematically gaming a particular error measure.
| Metric | What It Measures | Key Property |
|---|---|---|
| MAPE | Mean percentage error magnitude | Intuitive but unstable at low actuals |
| RMSE | Root mean squared error | Penalizes large errors; same units as the series |
| MASE | Mean absolute scaled error vs. seasonal naïve | Scale-free; MASE > 1.0 means worse than naïve |
| Theil's U | RMSE ratio vs. no-change naïve | U > 1.0 means model is worse than doing nothing |
The function below computes all four metrics from actuals and forecasts arrays. MASE uses a seasonal naïve benchmark with a configurable seasonal_period — for monthly data the default of 12 compares each forecast to the value from the same month one year prior. When the series is shorter than one full season, it falls back to a one-step naïve benchmark.
import numpy as np
from typing import Dict
def compute_forecast_metrics(
actuals: np.ndarray,
forecasts: np.ndarray,
seasonal_period: int = 12,
epsilon: float = 1e-8
) -> Dict[str, float]:
errors = actuals - forecasts
abs_errors = np.abs(errors)
# MAPE — skip near-zero actuals to avoid division instability
mask = np.abs(actuals) > epsilon
mape = np.mean(abs_errors[mask] / np.abs(actuals[mask])) * 100
# RMSE
rmse = np.sqrt(np.mean(errors ** 2))
# MASE — seasonal naïve benchmark
if len(actuals) > seasonal_period:
naive_errors = np.abs(actuals[seasonal_period:] - actuals[:-seasonal_period])
else:
naive_errors = np.abs(np.diff(actuals)) # one-step naïve fallback
naive_mae = np.mean(naive_errors)
mase = np.mean(abs_errors) / (naive_mae + epsilon)
# Theil's U — compare model RMSE to no-change naïve RMSE
naive_rmse = np.sqrt(np.mean((actuals[1:] - actuals[:-1]) ** 2))
theil_u = rmse / (naive_rmse + epsilon)
return {
'mape': round(float(mape), 4),
'rmse': round(float(rmse), 4),
'mase': round(float(mase), 4),
'theil_u': round(float(theil_u), 4),
}
A well-specified forecast model should produce residuals that are white noise: random, uncorrelated, and centered near zero. If residuals show autocorrelation — if this period's error predicts next period's error — the model is leaving systematic information on the table. That pattern is detectable and exploitable, which means the model is not doing its job.
The Ljung-Box test is the standard statistical tool for detecting residual autocorrelation. It tests the null hypothesis that residuals up to lag k are white noise. A p-value below 0.05 rejects that hypothesis and confirms the model has structural problems that cannot be patched by recalibration alone.
from statsmodels.stats.diagnostic import acorr_ljungbox
def residual_analysis(
actuals: np.ndarray,
forecasts: np.ndarray,
lags: int = 10,
significance: float = 0.05
) -> Dict:
residuals = actuals - forecasts
lb_result = acorr_ljungbox(residuals, lags=[lags], return_df=True)
lb_stat = float(lb_result['lb_stat'].iloc[-1])
lb_pval = float(lb_result['lb_pvalue'].iloc[-1])
autocorrelated = lb_pval < significance
residual_mean = float(np.mean(residuals))
residual_std = float(np.std(residuals))
max_abs_residual = float(np.max(np.abs(residuals)))
if autocorrelated and abs(residual_mean) > residual_std * 0.5:
diagnosis = "RETRAIN: systematic bias with autocorrelation"
elif autocorrelated:
diagnosis = "RETRAIN: autocorrelated residuals indicate model misspecification"
elif abs(residual_mean) > residual_std * 0.5:
diagnosis = "RECALIBRATE: bias without autocorrelation"
else:
diagnosis = "PASS: residuals appear well-behaved"
return {
'ljung_box_stat': round(lb_stat, 4),
'ljung_box_pvalue': round(lb_pval, 4),
'autocorrelated': autocorrelated,
'residual_mean': round(residual_mean, 4),
'residual_std': round(residual_std, 4),
'max_abs_residual': round(max_abs_residual, 4),
'diagnosis': diagnosis,
}
Not every model failure requires a full retrain. Retraining means rebuilding the model from scratch on a new or expanded dataset — a significant undertaking for complex models. Recalibration means adjusting existing parameters, updating intercepts, or applying a bias correction factor. Knowing which intervention is appropriate requires reading the diagnostic signals together.
| Diagnostic Signal | Recommended Action | Rationale |
|---|---|---|
| MASE > 1.0 | Retrain | Model underperforms a naïve baseline — structural failure |
| Autocorrelated + bias | Retrain | Model is missing a systematic component; recalibration cannot fix this |
| Non-autocorrelated + bias | Recalibrate | Model structure is correct; apply bias correction or update intercept |
| All metrics passing | Monitor | Continue scheduled evaluation; no intervention needed |
| Theil's U > 1.0 despite low MAPE | Retrain | Model exploits MAPE asymmetry; real-world performance is worse than reported |
"A forecast model that passes its MAPE target while underperforming a naïve benchmark is not a model that works — it is a model that has learned to game a poorly chosen metric."
Forecasts that detect their own degradation before it compounds into operational decisions prevent the inventory overages, cash flow shortfalls, and staffing errors that follow from using a model past its useful life.
A documented validation framework — metrics, thresholds, and a structured retrain trigger — satisfies the model governance requirements that PE-backed companies and regulated entities increasingly face from lenders and auditors.
When a board or lender asks how confident you are in your revenue forecast, the answer should not be "the model said so." A residual analysis and four-metric report is the kind of evidence that supports a capital decision.