Research review · methods backbone · study 00
Backtest Overfitting & the Deflated Sharpe Ratio
Across ~92,500 real strategies in crypto, equities and forex, the best looks like a star, and is a coin flip once you count the trials
If you try many strategy configurations and keep the best, the winner's backtest Sharpe is inflated by selection and tells you almost nothing. This study builds and verifies the quantitative apparatus that prices that inflation, the False Strategy Theorem, the Deflated Sharpe Ratio, the Probability of Backtest Overfitting and effective-trial counting, then turns it on three real, fully-costed corpora spanning roughly 92,500 strategies: 50,000 in crypto, 22,500 in US equities and 20,000 in forex. In every one of those asset classes the single best strategy fails to clear the multiple-testing null, and the calculator below lets you reproduce the core mechanic in the browser.
Deflated Sharpe Ratio (2014)
Bailey & López de Prado · J. Portfolio Management 40(5)
Code & data
lopez-de-prado-work-review / projects / 00
the claim
What López de Prado says
- If you try many strategy configurations and keep the best, the winner's backtest Sharpe is inflated by selection and tells you almost nothing, so any reported Sharpe must be discounted for the number of trials.
- He built the apparatus to price that inflation: the False Strategy Theorem (the expected maximum Sharpe of N skill-less trials, via the Euler–Mascheroni expansion), the Deflated Sharpe Ratio, the Probability of Backtest Overfitting (CSCV), and effective-trial counting.
- His demonstration: a pure random walk swept over thousands of parameter nodes produces an annualised Sharpe above 1 from noise alone, "any perseverant researcher will always be able to find a backtest with a desired Sharpe ratio."
our result
What we found
- Run on three real, fully-costed corpora, ~92,500 strategies across crypto (50,000), US equities (22,500) and forex (20,000), the single best strategy in every asset class lands below its False-Strategy-Theorem null (crypto 2.00 vs 2.72, equity 3.21 vs 10.89, forex 1.78 vs 7.70).
- Deflated Sharpe of the corpus best is 0.029 in crypto and effectively 0.000 in equities and forex, all far under the 0.95 bar; nothing survives deflation.
- Effective-trial counting confirms it: the nominal trial counts collapse to ~434 / 39 / 26 independent bets. The illusion is not a crypto quirk, it holds across markets, and the DSR is the gate every later study must clear.
The result in three lines
~92,500 strategies · 3 asset classes
The best is below the null everywhere
Three real, fully-costed corpora, 50,000 crypto strategies (20 pairs), 22,500 US-equity strategies (9 ETFs) and 20,000 forex strategies (8 majors). In every asset class the single corpus best lands below the False-Strategy-Theorem null: crypto 2.00 vs 2.72, equity 3.21 vs 10.89, forex 1.78 vs 7.70. What you'd get by luck is higher than what was found.
DSR < 0.95
Nothing survives deflation
The Deflated Sharpe Ratio of the corpus best is 0.029 in crypto and effectively 0.000 in equities and forex, all far below the 0.95 bar. Even the strongest strategy in each market is statistically indistinguishable from a lucky draw once selection is honestly accounted for.
Effective-N: 434 · 39 · 26
Multi-market, DSR-gated
Pooled, the 50,000 crypto trials are worth ~434 independent bets, the 22,500 equity trials ~39, and the 20,000 forex trials ~26 (eigenvalue participation ratio). The conclusion is gated on the Deflated Sharpe Ratio across all three asset classes, not on a single backtest.
At scale, ~92,500 real strategies across three asset classes
In crypto, equities and forex alike, the corpus best is below its null
The harness is built and self-tested on a clean grid (below), but the real test is at scale, and in more than one asset class. Three corpora carry the verdict: 50,000 deep walk-forward-optimised crypto strategies across 20 perp pairs, 22,500 US-equity strategies across 9 ETFs, and 20,000 forex strategies across 8 majors, roughly 92,500 strategies in all. Every one is realistically costed: crypto carries fee, slippage and funding; equity and forex run through a realism layer with time-of-day spreads, commission, FX rollover/triple-Wednesday swap with weekend force-flat, and equity short borrow.
In all three the single corpus best lands below the False Strategy Theorem's expected maximum for skill-less trials, and clears the 0.95 Deflated-Sharpe bar in none of them. Equity's high best Sharpe of 3.21 is long-beta index-ETF drift mined by correlated trials, still far below its null of 10.89 (DSR 0.000), with the 22,500 trials worth only about 39 independent bets. The illusion is not a crypto quirk; it holds across asset classes.
| market | n strat | best SR | E[max] null | DSR | median PBO | eff. trials |
|---|---|---|---|---|---|---|
| Crypto20 pairs | 50,000 | 2.00 | 2.72 | 0.029 | 0.28 | 434 |
| US Equity9 ETFs | 22,500 | 3.21 | 10.89 | 0.000 | 0.00 | 39 |
| Forex8 majors | 20,000 | 1.78 | 7.70 | 0.000 | 0.005 | 26 |
best SR and E[max] null are annualised; DSR uses the nominal trial count with the empirical trial-Sharpe dispersion (conservative). In every market the single best is below its False-Strategy-Theorem null and the Deflated Sharpe is far under the 0.95 bar.
Crypto in detail, 50,000 strategies, 20 pairs
The crypto corpus is 2,500 deep walk-forward-optimised strategies on each of 20 perp pairs, 50,000 strategies in all, structurally diverse and read from production per-trade ledgers that are already costed (5 bp fee, 3 bp slippage, 1 bp funding).
The corpus best annualises to a Sharpe of 2.00. That looks like a star. But the False Strategy Theorem says the expected maximum Sharpe of 50,000 skill-less trials is 2.72, higher than what was found. The Deflated Sharpe Ratio of that corpus best is 0.029, against a 0.95 bar. The 50,000 strategies carry only 434 independent bets, and the median per-pair probability of backtest overfitting is 0.28. Even the strongest strategy out of 50,000 is a coin flip once you count the trials.
strategies
50,000
2,500 × 20 pairs
corpus best Sharpe
2.00
annualised, net of costs
E[max] null (N=50k)
2.72
False Strategy Theorem
Deflated Sharpe
0.029
bar is 0.95
median per-pair PBO
0.28
CSCV
effective trials
434
of 50,000
Effective-trial counting is what makes the verdict honest at this scale. Strategies inside a pair are correlated, so the 2,500 nominal trials per pair are worth only a few hundred independent bets; pooled across all 20 pairs the 50,000 strategies collapse to 434. The null is fed that effective count, counting all 50,000 as independent would only make the discovered best look easier to find by luck.
The numbers below are per pair, straight from the production ledgers. No pair's best clears its own null after deflation.
| pair | T (bars) | best SR | PBO | eff. trials |
|---|---|---|---|---|
| AAVE | 1,772 | 0.96 | 0.25 | 327 |
| ALGO | 626 | 2.00 | 0.27 | 211 |
| APE | 1,251 | 0.84 | 0.47 | 274 |
| ARB | 626 | 1.60 | 0.50 | 166 |
| ATOM | 626 | 1.88 | 0.32 | 180 |
| AVAX | 1,772 | 1.17 | 0.35 | 332 |
| BCH | 2,085 | 0.84 | 0.14 | 411 |
| BNB | 1,563 | 0.81 | 0.15 | 186 |
| BTC | 2,817 | 0.62 | 0.19 | 234 |
| DOGE | 2,189 | 0.80 | 0.29 | 357 |
| DOT | 626 | 1.92 | 0.48 | 193 |
| ETC | 626 | 1.21 | 0.62 | 173 |
| ETH | 2,921 | 0.90 | 0.22 | 339 |
| LINK | 2,398 | 0.99 | 0.19 | 510 |
| LTC | 2,816 | 0.65 | 0.13 | 445 |
| NEAR | 1,772 | 0.90 | 0.39 | 367 |
| SOL | 1,457 | 1.25 | 0.24 | 427 |
| TRX | 2,607 | 0.77 | 0.27 | 292 |
| UNI | 1,772 | 1.17 | 0.36 | 444 |
| XLM | 626 | 1.75 | 0.49 | 193 |
Best SR is the per-pair best in-sample annualised Sharpe; eff. trials is the eigenvalue participation ratio of the trial-return correlation matrix. Pooled corpus: best 2.00 vs null 2.72, DSR 0.029, 434 effective trials.
Equities and forex, the same picture at scale
A controlled illustration
The same mechanism, isolated on a clean small grid, one structural family, 136 trials per instrument, so the only thing varying is the selection, not the strategy idea.
Before the ~92,500-strategy multi-market run above, the apparatus was built and verified on a deliberately clean, small grid: a single dual moving-average crossover swept over 136 (fast, slow) lookbacks per instrument, across crypto, equities and forex. This is the controlled illustration of the same effect the corpus shows at scale, small enough to reason about every trial, and to watch the winner sit at its own luck bar. From here on, a result is only "real" if it clears the Deflated Sharpe Ratio after honest trial-counting.
The bar a discovery must clear
The False Strategy Theorem gives the expected maximum Sharpe of N skill-less trials on T observations. Any target Sharpe is reachable by luck with enough trials, so the only honest question is whether a backtest beats its own luck bar. The calculator implements the theorem directly, the inverse-normal-CDF expansion with the Euler–Mascheroni correction, and pins a second panel to the study's real per-market medians. Drag N and T and watch the bar rise; load the 50,000-strategy corpus to see the best of 50k land below its null (E[max] ≈ 2.72), or the small N=136 grid to see the same effect on the controlled illustration.
Demo: False Strategy Theorem / Deflated Sharpe
Pick how many configurations you tried (N) and how many observations you have (T). The calculator returns the Sharpe you should expect to stumble on by luck alone, the bar a real edge must clear. Then test your own backtest Sharpe against it.
| market | best SR | E[max] null | DSR | PBO |
|---|---|---|---|---|
| Crypto (50k · 20 pairs) | 2.001 | 2.721 | 0.03 | 0.28 |
| US Equity (22.5k · 9 ETFs) | 3.206 | 10.886 | 0.00 | 0.00 |
| Forex (20k · 8 majors) | 1.779 | 7.701 | 0.00 | 0.01 |
Across ~92,500 real, fully-costed strategies in three asset classes, the single corpus best Sharpe (red) sits below its luck bar (amber) everywhere (crypto 2.00 vs 2.72, equity 3.21 vs 10.89, forex 1.78 vs 7.70), and the deflated Sharpe never approaches the 0.95 bar. Load the corpus (N=50k) to trace the crypto case in the chart above.
At N=50,000 trials and T=1,700 observations, the luck bar is 1.632 annualised. Your Sharpe of 2.00 clears it, necessary but not sufficient; the full DSR also docks for skew, kurtosis and correlated trials.
The harness was self-tested against Monte Carlo: the formula matches simulation to about 1e-3 and converges (N=10: 0.050 vs 0.048; N=100: 0.080 vs 0.080; N=1000: 0.103 vs 0.103). On a constructed example a skill-less winner deflates to DSR 0.32 while a genuine per-bar edge deflates to 0.87, the statistic does its job.
In the small grid too, the best is at or below luck
In all three markets the median best-of-136 in-sample Sharpe sits below the False-Strategy-Theorem expectation for skill-less trials. Forex is the starkest: a best-of-136 median Sharpe of 0.53 against a null of 1.06, you would expect to do better than that by chance. Thin trends plus the spread make forex's deflated significance the lowest of the three.
Nothing survives deflation
The Deflated Sharpe Ratio is the probability that the winner's true Sharpe exceeds the expected-max-under-the-null benchmark, after correcting for the number of trials, their dispersion, and the skew and kurtosis of returns. Median DSR is 0.10–0.44 everywhere, against a 0.95 significance bar. Alongside it, the Probability of Backtest Overfitting reaches 0.70 in crypto: the best in-sample configuration is more likely than not to be a below-median performer out of sample.
In-sample is not out-of-sample
Plotting each trial's in-sample Sharpe against its out-of-sample Sharpe makes the overfitting visible: the best in-sample point is nowhere near the best out-of-sample point. The ranking you would have selected on does not carry over.
136 trials ≈ 3 independent bets
The grid feels like 136 tests, but the strategies are roughly 55% pairwise-correlated, so the effective number of independent trials, the eigenvalue participation ratio of the trial-return correlation matrix, is about three. This is exactly why the null must be fed the effective count, not the nominal one: counting all 136 as independent would understate how easy the best draw was to find.
The verdict expressed in years
The Deflated Sharpe Ratio prices the inflation as a probability; the Minimum Track Record Length (MinTRL) and Minimum Backtest Length (MinBTL) price it in time. MinTRL is the track length that would be needed for the best Sharpe to clear its multiple-testing null with confidence; MinBTL is the track length below which a skill-less search of that many trials would be expected to throw up a Sharpe as high as the one observed, by luck alone.
MinTRL is infinite in all three markets: the best Sharpe is already at or below its null, so no finite track makes it credible against that bar. MinBTL is the same deflated verdict said in years. The crypto search needs about 4.5 years of track to justify 50,000 trials but ran on roughly 2.5; forex needs about 5.1 years and ran on roughly 3.7. The searches are simply too wide for the track they had.
| market | best SR | E[max] null | track (yrs) | MinTRL (yrs) | MinBTL (yrs) |
|---|---|---|---|---|---|
| CryptoALGO | 2.00 | 2.72 | 2.48 | infinite | 4.49 |
| US EquityQQQ | 3.21 | 10.89 | 19.35 | infinite | 1.60 |
| ForexUSDJPY | 1.78 | 7.70 | 3.71 | infinite | 5.12 |
Sharpe ratios are annualised; MinBTL uses the nominal trial count. MinTRL is infinite wherever the best Sharpe is at or below its null. MinBTL is the deflated-Sharpe verdict expressed in years of required track: crypto and forex ran well short of theirs.
By-market summary, controlled MA-grid illustration (136 trials)
Median values per market for the small controlled grid, net of costs, on information-driven bars. best SR is the best in-sample annualised Sharpe; E[max] null is the False-Strategy-Theorem expectation for skill-less trials; DSR is the Deflated Sharpe Ratio; PBO is the Probability of Backtest Overfitting; the last column is nominal → effective trials.
| market | best SR | E[max] null | DSR | PBO | trials |
|---|---|---|---|---|---|
| Crypto (8 perps) | 0.62 | 0.66 | 0.44 | 0.70 | 136 → 3 |
| Equities (7 ETFs) | 0.31 | 0.44 | 0.32 | 0.38 | 136 → 3 |
| Forex (8 majors) | 0.53 | 1.06 | 0.10 | 0.37 | 136 → 3 |
The cross-market consistency is the point of the illustration: the multiple-testing effect is not a crypto quirk; equities and forex show it too on the clean grid. And the ~92,500-strategy run at the top of this page is the same lesson at scale, real, fully-costed searches in all three asset classes where even the single best strategy out of tens of thousands does not clear its null.
Method
- One structural family per instrument: a dual moving-average crossover swept over a grid of (fast, slow) lookbacks, 136 trials each, so the only thing varying is the selection, not the strategy idea.
- Information-driven bars: dollar bars for crypto and equities (equities within regular hours), tick bars for forex, with realistic non-zero costs applied to turnover (crypto 7 bp, equities 2 bp, forex 1 bp per unit).
- Every trial's net per-bar return series feeds the harness, which returns the best in-sample annualised Sharpe, the Deflated Sharpe Ratio, the Probability of Backtest Overfitting (CSCV, 924 splits) and the effective number of independent trials.
- The False Strategy Theorem supplies the null: the expected maximum Sharpe of N skill-less trials. A discovered Sharpe is only interesting if it clears that bar, and then the DSR docks it further for skew, kurtosis and correlated trials.
- Effective-trial counting via the eigenvalue participation ratio of the trial-return correlation matrix: correlated trials are not independent, so the null is fed the effective count, not the nominal one.
Notes & limitations
Both runs deliberately find nothing, and that is the finding. The controlled grid (one structural family, 136 trials per instrument) isolates the mechanism; the ~92,500-strategy multi-market run shows it survives contact with real, fully-costed searches of thousands of correlated trials per instrument in crypto, equities and forex. The CSCV uses twelve blocks (924 combinations). The DSR uses the nominal trial count with the empirical trial-Sharpe dispersion, which is conservative: feeding it the effective count would raise the DSR slightly, and the "not significant" verdict holds either way.
Reproducibility
The harness, its self-tests, and the at-scale run are collected in the companion repository, project 00 of lopez-de-prado-work-review. The calculator on this page is self-contained: the False-Strategy-Theorem formula and the per-market medians are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.
Cite
References
The primary sources for the apparatus reviewed here:
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).
- Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance, 20(4).
- Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5).
- López de Prado, M. (2021). The False Strategy Theorem: A Financial Application of Experimental Mathematics. American Mathematical Monthly, 128(9).
- López de Prado, M. (2019). A Data Science Solution to the Multiple-Testing Crisis in Financial Research. Journal of Financial Data Science, 1(1).
See also
The selection-discipline theme that runs through this study is developed narratively in The edge is in the process, and the broader body of work is at Research.

