M/05 — Selection bias decomposition
Decomposing the IS-OOS Sharpe gap
A variance budget that splits in-sample to out-of-sample Sharpe degradation into selection bias, parameter-choice noise, and residual skill — across ten deep walk-forward crypto corpora.
The mathematics
Suppose a strategy family is indexed by a parameter k = 1, …, K. For each k you observe an in-sample Sharpe SISk and an out-of-sample Sharpe SOOSk. The standard practice — pick the in-sample maximum and hope for the best out-of-sample — defines the IS-OOS gap as
Decompose Δ additively. Write each SISk as a true skill term plus a parameter-choice deviation plus noise:
Then the gap splits into three terms:
The first term is purely mechanical. For K independent N(0, σ²) draws, Mills’ bound gives the expected maximum
which is tight to leading order. The second term is the within-grid σ inherited by the chosen k*. Only the third term — residual skill — should depend on the strategy actually doing something useful out-of-sample.
The variance budget
The companion repository runs a two-way decomposition of OOS-Sharpe variance over a (strategy × window) panel, observing a battery of proprietary perturbations per cell — the specific perturbation set is part of Daru Finance’s consulting work and is not enumerated here. The panel-level identity is
where Vparam = E[Vart SOOS(s, w, t)] is the within-cell variance across perturbations (knife-edge sensitivity), Vstrategy and Vwindow are between-strategy and between-window mean differences, Vfinite = (1 + ½·Sh²)/n is the analytic finite-sample noise floor, and R is the unexplained interaction.
Worked example
Take BTC_30m_27W from the deep-WFO crypto corpus: 538k OOS rows, 27 walk-forward windows. Mean OOS Sharpe across the surviving population is −0.696. Mean Vparam is 0.243. Live-proxy profitable rate (last two windows after the funnel filter): 8.4%.
Now plug into the null. With a parameter grid of K = 400 combinations and σ = 1 (typical IS Sharpe noise on 500 trades, q ≈ 0.2), the expected-max bound is
That is to say: under no skill at all, picking the in-sample winner from a grid of 400 produces an expected IS Sharpe near +3.5 — and an OOS expectation near zero. The empirical gap is comfortably reproduced without invoking real edge. The interactive demo below recomputes this every time you move K or σ.
Demo — predicted IS-OOS gap under a no-skill null
For a parameter grid of size K with iid N(0, σ²) IS Sharpes, the expected maximum is ≈ σ·√(2 log K). Move the sliders — the bar shows the predicted gap, decomposed into selection bias, parameter noise, and residual skill.
Under the null hypothesis of no skill, the entire IS-OOS gap is mechanical: scaling with √(log K) (selection) and σ (parameter noise). The residual skill term sits at zero.
At K=400 and σ=1, the expected-max-of-K Gaussians is ≈ 3.46 — the entire IS-OOS gap a strategy population reports under the null can be reproduced without invoking any real edge.
Empirical decomposition
On the canonical 10-asset corpus (ETH, BTC, LTC, TRX, XRP, LINK, ZEC, DOGE, BCH, AVAX — all 30m crypto with ≥17 walk-forward windows; 7.27M OOS rows; 289,374 live-proxy candidates), the pooled variance decomposition is:
- Vparam — 18.6% (parameter-choice noise across perturbations).
- Vstrategy — 16.8% (between-strategy main effect).
- Vfinite — 3.1% (analytic finite-sample floor).
- Vwindow — 1.2% (between-window means / regime proxy).
- Residual / interaction — 60.4%.
Mean IS-OOS Sharpe degradation across this corpus is 0.84. The residual-skill term is statistically indistinguishable from zero — the gap is mechanical.
Synthetic null
To check the decomposition is not an artefact of the panel structure, the same pipeline runs on synthetic data with three planted classes (robust / fragile / noise) and a deterministic RNG. The variance shape and decile lift reproduce — confirming the decomposition is identifying mechanical structure, not anomalies in the real corpus.
Why this matters for systematic strategies
Two practical consequences. First, any positive-edge claim from in-sample maximisation over a parameter grid of K must clear σ·√(2 log K) before it should be taken seriously — and σ is rarely smaller than 1 on bar-level crypto Sharpe estimates. Second, Vparam is fully observable in-sample (it’s computed from the IS run alone, no OOS leakage), so it can be used as a pre-filter at strategy-selection time. The deep-WFO empirics suggest the lowest-Vparam decile beats the highest by a factor of 2.2× in live-proxy profitability, monotonically.
The deflated-Sharpe and PBO frameworks of Bailey-López de Prado attack the same problem from the test-statistic side. This decomposition is complementary: it works at the population level rather than per-strategy, and it produces a usable in-sample predictor (Vparam) of out-of-sample profitability.
Reproducibility
DaruFinance / strategy-overfitting
Python — open source reference implementation
Minimal invocation
from strategy_overfitting import (
load_metrics, decompose_oos_variance, param_vs_live_lift
)
# Walk-forward output: long-format DataFrame with
# columns: asset, strategy, window, perturbation, S_IS, S_OOS, n_trades
df = load_metrics(parquet_root="/data/strategies", min_trades=20, sharpe_clip=5)
# Two-way decomposition over (strategy x window) x perturbations
shares = decompose_oos_variance(df)
shares["share_v_param"] # 0.186 - parameter-choice noise
shares["share_v_strategy"] # 0.168 - between-strategy main effect
shares["share_v_window"] # 0.012 - regime / window
shares["share_v_finite"] # 0.031 - analytic finite-sample floor
shares["share_residual"] # 0.604 - interaction + unexplained
# Predictive validity: rank strategies by V_param (in-sample) and
# measure live-proxy profitable rate by decile.
lift = param_vs_live_lift(df, n_deciles=10)
lift.D1_pct, lift.D10_pct # e.g. 11.7, 5.3
References
- [1]Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). The probability of backtest overfitting. Journal of Computational Finance 20(4), 39–69.
- [2]Harvey, C. R. & Liu, Y. (2014). Backtesting. Journal of Portfolio Management 42(1), 13–28.
- [3]Lo, A. W. & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies 3(3), 431–467.
- [4]López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley, ch. 11–12.