Back to López de Prado

Research review · methods backbone · study 00

Backtest Overfitting & the Deflated Sharpe Ratio

Across ~92,500 real strategies in crypto, equities and forex, the best looks like a star, and is a coin flip once you count the trials

If you try many strategy configurations and keep the best, the winner's backtest Sharpe is inflated by selection and tells you almost nothing. This study builds and verifies the quantitative apparatus that prices that inflation, the False Strategy Theorem, the Deflated Sharpe Ratio, the Probability of Backtest Overfitting and effective-trial counting, then turns it on three real, fully-costed corpora spanning roughly 92,500 strategies: 50,000 in crypto, 22,500 in US equities and 20,000 in forex. In every one of those asset classes the single best strategy fails to clear the multiple-testing null, and the calculator below lets you reproduce the core mechanic in the browser.

source

Deflated Sharpe Ratio (2014)

Bailey & López de Prado · J. Portfolio Management 40(5)

Code & data

lopez-de-prado-work-review / projects / 00

the claim

What López de Prado says

  • If you try many strategy configurations and keep the best, the winner's backtest Sharpe is inflated by selection and tells you almost nothing, so any reported Sharpe must be discounted for the number of trials.
  • He built the apparatus to price that inflation: the False Strategy Theorem (the expected maximum Sharpe of N skill-less trials, via the Euler–Mascheroni expansion), the Deflated Sharpe Ratio, the Probability of Backtest Overfitting (CSCV), and effective-trial counting.
  • His demonstration: a pure random walk swept over thousands of parameter nodes produces an annualised Sharpe above 1 from noise alone, "any perseverant researcher will always be able to find a backtest with a desired Sharpe ratio."

our result

What we found

  • Run on three real, fully-costed corpora, ~92,500 strategies across crypto (50,000), US equities (22,500) and forex (20,000), the single best strategy in every asset class lands below its False-Strategy-Theorem null (crypto 2.00 vs 2.72, equity 3.21 vs 10.89, forex 1.78 vs 7.70).
  • Deflated Sharpe of the corpus best is 0.029 in crypto and effectively 0.000 in equities and forex, all far under the 0.95 bar; nothing survives deflation.
  • Effective-trial counting confirms it: the nominal trial counts collapse to ~434 / 39 / 26 independent bets. The illusion is not a crypto quirk, it holds across markets, and the DSR is the gate every later study must clear.

The result in three lines

~92,500 strategies · 3 asset classes

The best is below the null everywhere

Three real, fully-costed corpora, 50,000 crypto strategies (20 pairs), 22,500 US-equity strategies (9 ETFs) and 20,000 forex strategies (8 majors). In every asset class the single corpus best lands below the False-Strategy-Theorem null: crypto 2.00 vs 2.72, equity 3.21 vs 10.89, forex 1.78 vs 7.70. What you'd get by luck is higher than what was found.

DSR < 0.95

Nothing survives deflation

The Deflated Sharpe Ratio of the corpus best is 0.029 in crypto and effectively 0.000 in equities and forex, all far below the 0.95 bar. Even the strongest strategy in each market is statistically indistinguishable from a lucky draw once selection is honestly accounted for.

Effective-N: 434 · 39 · 26

Multi-market, DSR-gated

Pooled, the 50,000 crypto trials are worth ~434 independent bets, the 22,500 equity trials ~39, and the 20,000 forex trials ~26 (eigenvalue participation ratio). The conclusion is gated on the Deflated Sharpe Ratio across all three asset classes, not on a single backtest.

At scale, ~92,500 real strategies across three asset classes

In crypto, equities and forex alike, the corpus best is below its null

Fig. 1:The multiple-testing illusion across three asset classes. Left: in crypto, US equity and forex the single corpus best annualised Sharpe (crypto 2.00, equity 3.21, forex 1.78) sits below the False-Strategy-Theorem null (2.72, 10.89, 7.70). Right: the nominal trial counts (50,000 / 22,500 / 20,000) collapse to a few hundred or fewer effective independent bets (434 / 39 / 26). None clears the deflated-significance bar.

The harness is built and self-tested on a clean grid (below), but the real test is at scale, and in more than one asset class. Three corpora carry the verdict: 50,000 deep walk-forward-optimised crypto strategies across 20 perp pairs, 22,500 US-equity strategies across 9 ETFs, and 20,000 forex strategies across 8 majors, roughly 92,500 strategies in all. Every one is realistically costed: crypto carries fee, slippage and funding; equity and forex run through a realism layer with time-of-day spreads, commission, FX rollover/triple-Wednesday swap with weekend force-flat, and equity short borrow.

In all three the single corpus best lands below the False Strategy Theorem's expected maximum for skill-less trials, and clears the 0.95 Deflated-Sharpe bar in none of them. Equity's high best Sharpe of 3.21 is long-beta index-ETF drift mined by correlated trials, still far below its null of 10.89 (DSR 0.000), with the 22,500 trials worth only about 39 independent bets. The illusion is not a crypto quirk; it holds across asset classes.

marketn stratbest SRE[max] nullDSRmedian PBOeff. trials
Crypto20 pairs50,0002.002.720.0290.28434
US Equity9 ETFs22,5003.2110.890.0000.0039
Forex8 majors20,0001.787.700.0000.00526

best SR and E[max] null are annualised; DSR uses the nominal trial count with the empirical trial-Sharpe dispersion (conservative). In every market the single best is below its False-Strategy-Theorem null and the Deflated Sharpe is far under the 0.95 bar.

Crypto in detail, 50,000 strategies, 20 pairs

The crypto corpus is 2,500 deep walk-forward-optimised strategies on each of 20 perp pairs, 50,000 strategies in all, structurally diverse and read from production per-trade ledgers that are already costed (5 bp fee, 3 bp slippage, 1 bp funding).

The corpus best annualises to a Sharpe of 2.00. That looks like a star. But the False Strategy Theorem says the expected maximum Sharpe of 50,000 skill-less trials is 2.72, higher than what was found. The Deflated Sharpe Ratio of that corpus best is 0.029, against a 0.95 bar. The 50,000 strategies carry only 434 independent bets, and the median per-pair probability of backtest overfitting is 0.28. Even the strongest strategy out of 50,000 is a coin flip once you count the trials.

Fig. 2:The 50,000-strategy crypto annualised-Sharpe distribution. The corpus best (Sharpe 2.00) sits to the left of the False-Strategy-Theorem null bar (2.72), the expected maximum of 50,000 skill-less trials is higher than the best strategy actually found.

strategies

50,000

2,500 × 20 pairs

corpus best Sharpe

2.00

annualised, net of costs

E[max] null (N=50k)

2.72

False Strategy Theorem

Deflated Sharpe

0.029

bar is 0.95

median per-pair PBO

0.28

CSCV

effective trials

434

of 50,000

Fig. 3:Per-pair probability of backtest overfitting (left) and effective number of independent trials (right). PBO clusters around its 0.28 median, and the nominal 2,500 trials per pair collapse to a few hundred effective bets, pooled across the corpus, 50,000 trials are worth about 434.

Effective-trial counting is what makes the verdict honest at this scale. Strategies inside a pair are correlated, so the 2,500 nominal trials per pair are worth only a few hundred independent bets; pooled across all 20 pairs the 50,000 strategies collapse to 434. The null is fed that effective count, counting all 50,000 as independent would only make the discovered best look easier to find by luck.

The numbers below are per pair, straight from the production ledgers. No pair's best clears its own null after deflation.

pairT (bars)best SRPBOeff. trials
AAVE1,7720.960.25327
ALGO6262.000.27211
APE1,2510.840.47274
ARB6261.600.50166
ATOM6261.880.32180
AVAX1,7721.170.35332
BCH2,0850.840.14411
BNB1,5630.810.15186
BTC2,8170.620.19234
DOGE2,1890.800.29357
DOT6261.920.48193
ETC6261.210.62173
ETH2,9210.900.22339
LINK2,3980.990.19510
LTC2,8160.650.13445
NEAR1,7720.900.39367
SOL1,4571.250.24427
TRX2,6070.770.27292
UNI1,7721.170.36444
XLM6261.750.49193

Best SR is the per-pair best in-sample annualised Sharpe; eff. trials is the eigenvalue participation ratio of the trial-return correlation matrix. Pooled corpus: best 2.00 vs null 2.72, DSR 0.029, 434 effective trials.

Equities and forex, the same picture at scale

Fig. 4:US equity, 22,500 strategies across 9 ETFs. The corpus best (Sharpe 3.21) is long-beta index-ETF drift mined by correlated trials, yet it sits far below the False-Strategy-Theorem null (10.89): DSR 0.000, only ~39 effective independent bets out of 22,500.
Fig. 5:Forex, 20,000 strategies across 8 majors. The corpus best (Sharpe 1.78) sits well below the False-Strategy-Theorem null (7.70): DSR 0.000, median PBO 0.005, only ~26 effective independent bets out of 20,000.

A controlled illustration

The same mechanism, isolated on a clean small grid, one structural family, 136 trials per instrument, so the only thing varying is the selection, not the strategy idea.

Before the ~92,500-strategy multi-market run above, the apparatus was built and verified on a deliberately clean, small grid: a single dual moving-average crossover swept over 136 (fast, slow) lookbacks per instrument, across crypto, equities and forex. This is the controlled illustration of the same effect the corpus shows at scale, small enough to reason about every trial, and to watch the winner sit at its own luck bar. From here on, a result is only "real" if it clears the Deflated Sharpe Ratio after honest trial-counting.

The bar a discovery must clear

The False Strategy Theorem gives the expected maximum Sharpe of N skill-less trials on T observations. Any target Sharpe is reachable by luck with enough trials, so the only honest question is whether a backtest beats its own luck bar. The calculator implements the theorem directly, the inverse-normal-CDF expansion with the Euler–Mascheroni correction, and pins a second panel to the study's real per-market medians. Drag N and T and watch the bar rise; load the 50,000-strategy corpus to see the best of 50k land below its null (E[max] ≈ 2.72), or the small N=136 grid to see the same effect on the controlled illustration.

Demo: False Strategy Theorem / Deflated Sharpe

Pick how many configurations you tried (N) and how many observations you have (T). The calculator returns the Sharpe you should expect to stumble on by luck alone, the bar a real edge must clear. Then test your own backtest Sharpe against it.

E[max SR] under the null
N, number of trials50,000
1101001k10k100k
T, observations (bars)1700
Your annualised Sharpe2.00
luck bar, E[max SR], annualised
1.632
your Sharpe vs bar
+0.368
DSR-style p(edge is real)
0.83
ANNUALISED SHARPE: your backtest vs the luck bar0.00.51.01.52.0luck bar1.632your SR2.00N=50,000 · T=1,700 · clears the bar
At scale: single corpus best vs its own null, by asset class
marketbest SRE[max] nullDSRPBO
Crypto (50k · 20 pairs)2.0012.7210.030.28
US Equity (22.5k · 9 ETFs)3.20610.8860.000.00
Forex (20k · 8 majors)1.7797.7010.000.01

Across ~92,500 real, fully-costed strategies in three asset classes, the single corpus best Sharpe (red) sits below its luck bar (amber) everywhere (crypto 2.00 vs 2.72, equity 3.21 vs 10.89, forex 1.78 vs 7.70), and the deflated Sharpe never approaches the 0.95 bar. Load the corpus (N=50k) to trace the crypto case in the chart above.

At N=50,000 trials and T=1,700 observations, the luck bar is 1.632 annualised. Your Sharpe of 2.00 clears it, necessary but not sufficient; the full DSR also docks for skew, kurtosis and correlated trials.

The harness was self-tested against Monte Carlo: the formula matches simulation to about 1e-3 and converges (N=10: 0.050 vs 0.048; N=100: 0.080 vs 0.080; N=1000: 0.103 vs 0.103). On a constructed example a skill-less winner deflates to DSR 0.32 while a genuine per-bar edge deflates to 0.87, the statistic does its job.

Fig. 6:Controlled MA-grid illustration (136 trials). Best in-sample Sharpe (observed) versus the expected maximum Sharpe of skill-less trials (the False-Strategy-Theorem null). In every market the observed best is at or below the null, the apparent edge is what 136 trials buys you by luck.

In the small grid too, the best is at or below luck

In all three markets the median best-of-136 in-sample Sharpe sits below the False-Strategy-Theorem expectation for skill-less trials. Forex is the starkest: a best-of-136 median Sharpe of 0.53 against a null of 1.06, you would expect to do better than that by chance. Thin trends plus the spread make forex's deflated significance the lowest of the three.

Nothing survives deflation

The Deflated Sharpe Ratio is the probability that the winner's true Sharpe exceeds the expected-max-under-the-null benchmark, after correcting for the number of trials, their dispersion, and the skew and kurtosis of returns. Median DSR is 0.10–0.44 everywhere, against a 0.95 significance bar. Alongside it, the Probability of Backtest Overfitting reaches 0.70 in crypto: the best in-sample configuration is more likely than not to be a below-median performer out of sample.

Fig. 7:Controlled MA-grid illustration (136 trials). Per-instrument Deflated Sharpe Ratio (left) and Probability of Backtest Overfitting (right). No instrument's DSR approaches the 0.95 bar; crypto's PBO median of 0.70 says the in-sample winner usually lands in the bottom half out of sample.
Fig. 8:Controlled MA-grid illustration (136 trials). In-sample versus out-of-sample Sharpe across the trial grid. The configuration you would have picked in-sample is not the one that wins out-of-sample, the textbook signature of selection on noise.

In-sample is not out-of-sample

Plotting each trial's in-sample Sharpe against its out-of-sample Sharpe makes the overfitting visible: the best in-sample point is nowhere near the best out-of-sample point. The ranking you would have selected on does not carry over.

136 trials ≈ 3 independent bets

The grid feels like 136 tests, but the strategies are roughly 55% pairwise-correlated, so the effective number of independent trials, the eigenvalue participation ratio of the trial-return correlation matrix, is about three. This is exactly why the null must be fed the effective count, not the nominal one: counting all 136 as independent would understate how easy the best draw was to find.

Fig. 9:Controlled MA-grid illustration (136 trials). Nominal versus effective trials per instrument. The 136-point grid collapses to about three independent bets once trial correlation is accounted for.

The verdict expressed in years

The Deflated Sharpe Ratio prices the inflation as a probability; the Minimum Track Record Length (MinTRL) and Minimum Backtest Length (MinBTL) price it in time. MinTRL is the track length that would be needed for the best Sharpe to clear its multiple-testing null with confidence; MinBTL is the track length below which a skill-less search of that many trials would be expected to throw up a Sharpe as high as the one observed, by luck alone.

MinTRL is infinite in all three markets: the best Sharpe is already at or below its null, so no finite track makes it credible against that bar. MinBTL is the same deflated verdict said in years. The crypto search needs about 4.5 years of track to justify 50,000 trials but ran on roughly 2.5; forex needs about 5.1 years and ran on roughly 3.7. The searches are simply too wide for the track they had.

Fig. 10:Observed track length versus the track length required to trust the best of each search (log scale, years). Crypto and forex fall short of their nominal-N MinBTL (about 4.5 years against 2.5, and 5.1 against 3.7); only equity, on nearly two decades of data, clears its bar, yet its best still fails the deflated test because its null is so high. MinTRL is infinite in every market.
marketbest SRE[max] nulltrack (yrs)MinTRL (yrs)MinBTL (yrs)
CryptoALGO2.002.722.48infinite4.49
US EquityQQQ3.2110.8919.35infinite1.60
ForexUSDJPY1.787.703.71infinite5.12

Sharpe ratios are annualised; MinBTL uses the nominal trial count. MinTRL is infinite wherever the best Sharpe is at or below its null. MinBTL is the deflated-Sharpe verdict expressed in years of required track: crypto and forex ran well short of theirs.

By-market summary, controlled MA-grid illustration (136 trials)

Median values per market for the small controlled grid, net of costs, on information-driven bars. best SR is the best in-sample annualised Sharpe; E[max] null is the False-Strategy-Theorem expectation for skill-less trials; DSR is the Deflated Sharpe Ratio; PBO is the Probability of Backtest Overfitting; the last column is nominal → effective trials.

marketbest SRE[max] nullDSRPBOtrials
Crypto (8 perps)0.620.660.440.70136 → 3
Equities (7 ETFs)0.310.440.320.38136 → 3
Forex (8 majors)0.531.060.100.37136 → 3

The cross-market consistency is the point of the illustration: the multiple-testing effect is not a crypto quirk; equities and forex show it too on the clean grid. And the ~92,500-strategy run at the top of this page is the same lesson at scale, real, fully-costed searches in all three asset classes where even the single best strategy out of tens of thousands does not clear its null.

Method

  • One structural family per instrument: a dual moving-average crossover swept over a grid of (fast, slow) lookbacks, 136 trials each, so the only thing varying is the selection, not the strategy idea.
  • Information-driven bars: dollar bars for crypto and equities (equities within regular hours), tick bars for forex, with realistic non-zero costs applied to turnover (crypto 7 bp, equities 2 bp, forex 1 bp per unit).
  • Every trial's net per-bar return series feeds the harness, which returns the best in-sample annualised Sharpe, the Deflated Sharpe Ratio, the Probability of Backtest Overfitting (CSCV, 924 splits) and the effective number of independent trials.
  • The False Strategy Theorem supplies the null: the expected maximum Sharpe of N skill-less trials. A discovered Sharpe is only interesting if it clears that bar, and then the DSR docks it further for skew, kurtosis and correlated trials.
  • Effective-trial counting via the eigenvalue participation ratio of the trial-return correlation matrix: correlated trials are not independent, so the null is fed the effective count, not the nominal one.

Notes & limitations

Both runs deliberately find nothing, and that is the finding. The controlled grid (one structural family, 136 trials per instrument) isolates the mechanism; the ~92,500-strategy multi-market run shows it survives contact with real, fully-costed searches of thousands of correlated trials per instrument in crypto, equities and forex. The CSCV uses twelve blocks (924 combinations). The DSR uses the nominal trial count with the empirical trial-Sharpe dispersion, which is conservative: feeding it the effective count would raise the DSR slightly, and the "not significant" verdict holds either way.

Reproducibility

The harness, its self-tests, and the at-scale run are collected in the companion repository, project 00 of lopez-de-prado-work-review. The calculator on this page is self-contained: the False-Strategy-Theorem formula and the per-market medians are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.

Cite

References

The primary sources for the apparatus reviewed here:

See also

The selection-discipline theme that runs through this study is developed narratively in The edge is in the process, and the broader body of work is at Research.

Backtest Overfitting & the Deflated Sharpe Ratio | Daru Finance