Research review · research process · study 11
Meta-Strategy Organization
Run research like a factory assembly line of separable, specialized stations with one shared validation gate, and log every trial, because hiding the number of attempts is what turns selection into a fake edge
This is the capstone of the program. López de Prado argues that quant research should be organized like a factory assembly line of specialized, separable stations rather than a lone craftsman, bound by two principles: separability (no one designs a bet and also grades it) and mandatory disclosure (every trial from every station is logged, so a reported winner can be deflated by the true number of attempts N). The anti-pattern he warns against is the lone “Sisyphus” quant who finds strategies by backtesting one idea over and over and reports only the best. The honest evidence for the whole thesis is this program's own scorecard: roughly 98,000 strategy configurations evaluated across the studies, gated on the same deflated bar. The null holds broadly across the naive and single-series regimes, almost nothing clears. But the gate is not a blanket no: three disciplined regimes do survive deflation, cross-sectional ranking with a gradient-boosted model on every horizon, the Ornstein–Uhlenbeck trading rule on its true mean-reverting spread, and large-universe risk parity. They approach but do not beat the best static archetype (about PF 1.17), and the fully-disclosed end-to-end pipeline still deploys nothing in a held-out window. The playground below shows exactly how much free Sharpe selection buys when N is hidden.
Working paper
write-up in review, not yet posted
Code & data
lopez-de-prado-work-review / projects / 16
the claim
What López de Prado says
- Quant research should run like a factory assembly line of specialized, separable stations, data curators, feature analysts, strategists, backtesters and a validation station, deployment and sizing, portfolio oversight, not like a lone craftsman who does everything.
- Two principles bind the line. Separability: no one designs a bet and also grades it, because the person who builds a strategy and judges it will judge it kindly. Mandatory disclosure: every trial from every station is logged, so when a winner is reported the validation station knows the true number of trials N and can deflate the result.
- The warning is the lone Sisyphus quant who discovers strategies by repeatedly backtesting one idea and reporting only the best. The reason it fails is mathematical, not a matter of willpower: try enough configurations, report only the winner, and the winner looks strong even with zero real edge. Hiding N is what makes the number meaningless; disclosing N is the only thing that lets anyone correct for it.
our result
What we found
- The lone backtester earns nothing out of sample. Across 1,208 walk-forward windows on 39 instruments, picking the single best in-sample strategy each quarter and deploying it gave a pooled realized out-of-sample Sharpe of negative 0.02 (95 percent band negative 0.21 to positive 0.18). The picks had a median in-sample Sharpe of 3.09 annualized, so the in-sample-to-out-of-sample decay was 2.61 Sharpe, the winner's curse erases the whole apparent edge. Out-of-sample Sharpe was negative for 82 percent of instruments and the median hit rate was 0.40, worse than a coin flip. The disclosed process, refusing to deploy any single strategy that fails deflation, deployed nothing.
- Diversification buys deflated edge only where the market has structure. Combining many weakly-correlated strategies lowered the median in-shortlist correlation to 0.11 and raised the effective number of independent bets to about 17, but the pooled fixed-policy sleeve still realized an out-of-sample Sharpe of negative 0.64. Only 3 of 39 instruments cleared the 0.95 portfolio-level Deflated Sharpe bar, and all three are broad equity index funds (the Nasdaq-100 proxy at OOS Sharpe 2.50 and DSR 1.00, the S&P-500 proxy at 1.96, the small-cap proxy at 0.85). Every crypto, FX, sector, and volatility sleeve was negative. A portfolio of noise is still noise.
- The program's own best result falls below the skill-less expectation. Backtest overfitting, measured program-wide by Combinatorially-Symmetric Cross-Validation, was 0.21 (0.34 crypto, 0.11 FX, 0.00 equity). And across 985,570 eligible configurations the observed best annualized Sharpe is 3.21, while the expected maximum Sharpe of purely skill-less trials at this scale is 5.22 to 5.61. So the program's best result, impressive in isolation, sits below what selection on noise alone would produce once the search is disclosed, the curve crosses 3.21 at only about 200 trials. That gap, disclosed plainly, is the thesis.
- But the null is not absolute, discipline survives where naive search does not. Three regimes clear the deflation gate honestly. Cross-sectional ranking with a gradient-boosted model clears the Deflated Sharpe bar on 10 of 10 holding horizons (probability of backtest overfitting 0.10, in-sample-to-out-of-sample rank correlation 0.78, median out-of-sample profit factor about 1.12), the program's first DSR-surviving machine-learning edge. The Ornstein–Uhlenbeck optimal-trading-rule, run on its true regime (the residual spread of a cointegrated pair rather than a single series), clears DSR on 3 of 5 pairs at a median profit factor of 2.11 and beats a Bollinger-band control on 4 of 5. And large-universe hierarchical risk parity shows a real out-of-sample variance advantage over equal weight that grows with breadth (HRP-to-1/N variance ratio 0.87 at 25 names, 0.45 at 200). Each of these approaches but does not beat the best static structural archetype at about PF 1.17, and the fully-disclosed end-to-end pipeline still deploys nothing in a held-out window, so the honest reading is that disciplined ML matches the static edge without dominating it.
The result in three lines
E1 · 1,208 windows · 39 instruments
The lone backtester earns nothing out of sample
Picking the single best in-sample strategy each quarter and deploying it gave a pooled realized out-of-sample Sharpe of negative 0.02 (95 percent band negative 0.21 to +0.18). The picks had a median in-sample Sharpe of 3.09, so the decay was 2.61 Sharpe, the winner's curse erases the whole edge. Out-of-sample Sharpe was negative on 82 percent of instruments; the median hit rate was 0.40. The disclosed process, refusing any single strategy that fails deflation, deployed nothing in all 1,208 windows.
E2 · 39 instruments · ~17 effective bets
Diversification pays only where the market has structure
Combining weakly-correlated bets cut the median in-shortlist correlation to 0.11 and raised the effective bet count to about 17, but the pooled fixed-policy sleeve still realized an out-of-sample Sharpe of negative 0.64. Only 3 of 39 instruments cleared the 0.95 portfolio Deflated Sharpe bar, and all three are broad equity index funds (QQQ at OOS 2.50 / DSR 1.00, SPY 1.96, IWM 0.85). Every crypto, FX, sector, and volatility sleeve was negative.
E4 · 985,570 trials · best 3.21
The program's best is below the skill-less expectation
Program-wide backtest overfitting was 0.21 (0.34 crypto, 0.11 FX, 0.00 equity). Across 985,570 eligible configurations the observed best annualized Sharpe is 3.21, while pure selection on noise at this scale would be expected to produce 5.22 to 5.61. The best result sits below the skill-less expectation once the search is disclosed; the curve crosses 3.21 at only about 200 trials. That gap is the disclosure effect, and the thesis.
Discipline · 3 regimes clear DSR
But the null is not absolute, discipline survives
The gate is not a blanket no. Cross-sectional ranking with a gradient-boosted model clears the Deflated Sharpe bar on 10 of 10 horizons (PBO 0.10, IS→OOS rank ρ 0.78, median PF ~1.12), the program's first DSR-surviving ML edge. OU on its true mean-reverting spread clears DSR on 3 of 5 cointegrated pairs (median PF 2.11). Large-universe risk parity's variance advantage over 1/N grows with breadth (0.87 → 0.45 from 25 to 200 names). Each approaches but does not beat the best static archetype (~PF 1.17), and the fully-disclosed pipeline still deploys nothing held out.
Disclose your trials
Drag the number of configurations N a researcher quietly searched. The curve is the expected maximum Sharpe of skill-less strategies; with no real edge, the best of N still climbs with N. The lone-quant marker sits at 2,500 silent trials. This is a pedagogical mechanic, not a tradeable claim.
The point of the playground is that selection masquerades as skill unless every trial is disclosed. Slide N up and the expected maximum Sharpe of a no-edge search climbs with it, by 2,500 trials it is near 1.8, a number that would look like a serious strategy. A lone quant who reports only the best of those trials and hides N hands you that figure as if it were skill. Disclosing N is the one thing that lets the Deflated Sharpe Ratio subtract this benchmark and deflate the winner toward the 0.95 chance bar. That disclosure discipline is exactly what the assembly line institutionalizes.
Demo, disclose your trials
Drag the number of configurations N a researcher quietly searched. The curve is the expected maximum Sharpe of skill-less strategies (False Strategy Theorem, T = 1000 observations per trial): with zero real edge by construction, the best of N backtests still climbs with N. A lone quant who reports only the best and hides N hands you that number as if it were skill. Disclosing N is the one thing that lets the Deflated Sharpe Ratio subtract this benchmark.
N is the true number of configurations tried, including every silent re-run.
The single best of N backtests, annualised, the one number that gets published.
Searching … → 1.76. skill-less configurations and reporting only the best buys this much annualised Sharpe from pure selection. Until N is disclosed, there is no way to tell it from skill; once it is, the Deflated Sharpe Ratio subtracts the benchmark and the winner deflates toward the 0.95 chance bar.
The Sisyphus trap, quantified
The False Strategy Theorem gives the expected maximum Sharpe of N skill-less trials, each estimated over a track of T observations. With zero true edge by construction, the best of N backtests still climbs steadily with N, and that climb is pure selection. At 2,500 silent trials, the size of this program's per-instrument crypto corpus, selection alone buys about 1.8 annualised Sharpe. If the quant never tells you N, you have no way to know the 1.8 is a mirage. Disclosing N is exactly what lets the Deflated Sharpe Ratio subtract this benchmark and reveal the winner as luck.
Six separable stations, one shared gate
The schematic lays out the six stations and maps each completed study to the station it stress-tested: data curators (information-driven bars, fractional differentiation), feature analysts (structural breaks and entropy, microstructural features, causal factors), strategists (triple-barrier and meta-labeling, trend-scanning labels), backtesters and validation (purged cross-validation, ensembles and importance, and the overfitting and Deflated Sharpe harness itself), deployment and sizing (bet sizing, optimal trading rules), and portfolio oversight (portfolio construction). No single study did everything, and no study graded its own work, because the validation station was shared and applied identically. The disclosure loop logs every trial so the gate can deflate by the true N.
The program scorecard: the gate held broadly, but discipline cleared it
Configurations evaluated versus configurations clearing the deflation gate, per study. Across the naive and single-series studies the line evaluated roughly 98,000 configurations and almost none cleared; the 92,500-strategy overfitting corpus clears in 0 of 3 markets. But the gate is not a blanket no. Three disciplined regimes survive deflation honestly: cross-sectional ranking with a gradient-boosted model clears DSR on 10 of 10 horizons (the program's first DSR-surviving machine-learning edge), the Ornstein–Uhlenbeck rule on its true mean-reverting spread clears 3 of 5 cointegrated pairs, and large-universe risk parity's out-of-sample variance advantage over equal weight grows with breadth. Each approaches but does not beat the best static archetype near PF 1.17. The factory ran end to end, the gate said no to the naive search and yes only to discipline, which is the discipline working as intended.
The program as an assembly line, a meta-analysis
Every study, gated on one deflated bar
Every number below is read from the studies' own result tables. The verdict is two-sided: the naive and single-series studies, roughly 98,000 configurations, clear almost nothing, while three disciplined regimes do survive the same deflated bar (cross-sectional ranking with a gradient-boosted model, the Ornstein–Uhlenbeck rule on its true mean-reverting spread, and large-universe risk parity). The data-representation studies (information-driven bars, fractional differentiation) and the leakage study are statistical-property work, cost-free by nature, so they carry no strategy-level deflation gate and are excluded from the trial total and marked as such.
| study | station | trials | clear DSR>0.95 | claim reproduced |
|---|---|---|---|---|
| Backtest overfitting & DSR | Validation | 92,500 | 0 / 3 markets | Yes, best Sharpe below the null E[max] in all 3 markets; DSR 0.00–0.03 |
| Information-driven bars | Data representation | , | n/a | Yes, excess kurtosis collapses toward Gaussian in all 3 markets (property study) |
| Fractional differentiation | Data representation | , | n/a | Yes, FFD keeps ~0.98 memory while buying stationarity (property study) |
| Triple-barrier + meta-labeling | Labeling | 1,134 | 0 / 42 | Partly, mechanics reproduce; meta-label adds no deflated edge |
| Purged CV / CPCV leakage | Validation | , | n/a | Yes, naive k-fold inflates the score; purging removes it (property study) |
| Trend-scanning labels | Labeling | 378 | 19 / 42 | Partly, DSR hits exist but the labeller ties its fixed-horizon control (19 vs 18 of 42) |
| Structural breaks & entropy | Features | 108 | 0 | Yes as features, but not tradeable; best DSR 0.002 |
| Microstructural features | Features | 144 | 0 | Partly, estimators cheap, but nothing tradeable after costs |
| Cross-sectional ML (LightGBM ranking) | Model / strategy | 10 horizons | 10 / 10 | Yes, the program's first DSR-surviving ML edge; PBO 0.10, IS→OOS rank ρ 0.78, median PF ~1.12; approaches but does not beat the best static archetype |
| Ensembles & feature importance | Model / importance | 1,280 | 0 / 40 | Yes on the science, bagging gap ~5× smaller, MDA less biased; no deflated edge |
| Bet sizing | Sizing | 1,344 | 0 / 42 | Partly, sizing changes turnover, not deflated performance (PBO ~0.49) |
| Optimal trading rules (OU on a cointegrated spread) | Sizing / exits | 5 pairs | 3 / 5 | Yes in its true regime, on the mean-reverting residual spread the OU rule clears DSR on 3 of 5 pairs (median PF 2.11) and beats the band control on 4 of 5; the single-series mesh still ties its IS-tuned control |
| Portfolio construction (HRP / large universe) | Allocation | 5 sizes | ranking | Yes on ranking, HRP's out-of-sample variance advantage over 1/N grows with breadth (HRP/1N variance 0.87 at N=25 → 0.45 at N=200); 0 clear an absolute DSR on this net-noisy panel |
| Causal factor investing | Features / validation | 9 | 0 | Yes, MC reproduces the bias; no real factor survives adjustment + deflation (best DSR 0.45) |
“Trials” counts strategy configurations evaluated; property studies carry no strategy deflation gate and are excluded from the ~98,000 naive-regime total. The only synthetic data anywhere in the program is the labelled Monte-Carlo that validates the expected-max-Sharpe formula; everything else is real market data. The disciplined regimes that clear the gate, cross-sectional ranking (10 of 10 horizons), OU on its true cointegrated spread (3 of 5 pairs), and large-universe risk parity, approach but do not beat the best static archetype near PF 1.17; the older single-series trend-scan and OU-mesh bests still tie their own in-sample-tuned controls.
Four experiments on the program's own corpus
About 1.20 million strategy configurations, across 42 instruments, put on trial
The scorecard above is the program reading its own result tables. The four experiments below go further: they re-run the organizational thesis as a controlled, after-cost experiment on the program's own corpus of real strategy results, about 1.20 million strategy configurations across 42 instruments spanning crypto, equities, and foreign exchange, with every realized stream net of costs and purged of look-ahead by construction. Each experiment answers one question the thesis raises, and where the result is negative it is reported as negative.
E1 · Sisyphus versus the disclosed assembly line
Across 1,208 walk-forward windows on 39 instruments (one year in-sample, one quarter out-of-sample, stepped quarterly), we ran two research processes side by side on the same candidate pool. The lone backtester picks the single best in-sample strategy each window and deploys it; the disclosed assembly line gates the pool on the False Strategy Theorem null for the full number of trials searched and deploys only strategies that clear the Deflated Sharpe bar.
The lone backtester's picks had a median in-sample Sharpe of 3.09 annualized and a pooled realized out-of-sample Sharpe of negative 0.02 (95 percent band negative 0.21 to positive 0.18). The mean in-sample-to-out-of-sample decay was 2.61 annualized Sharpe, the winner's curse erases the entire apparent edge. Realized out-of-sample Sharpe was negative for 82 percent of instruments, and the median hit rate was 0.40, worse than a coin flip. The disclosed process, insisting that any single strategy clear the disclosed-trial null, deployed nothing in any of the 1,208 windows; the best single strategy's Deflated Sharpe was effectively zero every window (per-window maximum 0.0001), and relaxing the bar from 0.95 to 0.60 changed nothing. Refusing to deploy is not a failure of the discipline; it is the discipline correctly reporting that no single backtest survives once you admit how many were tried.
E2 · The meta-strategy portfolio
López de Prado's constructive answer is not one great strategy but many weak, weakly-correlated ones combined, deflating the sleeve rather than each part. We built diversified shortlists per instrument and window, allocated four ways, and deflated the sleeve against a portfolio-level null.
The diversification was real: the median in-shortlist absolute correlation fell to 0.11 and the effective number of independent bets rose to about 17. But combining weak bets mostly combined noise. The pooled fixed-policy sleeve had a realized out-of-sample Sharpe of negative 0.64, and only 3 of 39 instruments cleared the 0.95 Deflated Sharpe bar at the portfolio level, and all three are broad equity index funds. Every crypto, foreign-exchange, sector, and volatility sleeve was negative. Diversification lowers correlation and raises the effective bet count exactly as advertised; whether that buys deflated edge depends entirely on whether the market carries persistent structure after costs. A portfolio of noise is still noise.
E2 · the only sleeves that cleared
Three broad equity indices, and nothing else
Of 39 instruments, only three cleared the 0.95 portfolio-level Deflated Sharpe bar, and all three are broad equity index funds. Every crypto, foreign-exchange, sector, and volatility sleeve had a negative realized out-of-sample Sharpe and a portfolio Deflated Sharpe at or near zero.
| sleeve | what it is | OOS Sharpe (ann.) | portfolio DSR |
|---|---|---|---|
| QQQ | Nasdaq-100 index proxy | 2.50 | 1.00 |
| SPY | S&P 500 index proxy | 1.96 | 1.00 |
| IWM | small-cap (Russell 2000) proxy | 0.85 | 1.00 |
Realized out-of-sample Sharpe of the best deflated sleeve, with the portfolio-level Deflated Sharpe Ratio. The pooled fixed-policy sleeve across all 39 instruments was negative 0.64; the median portfolio Deflated Sharpe was essentially zero. The edge that survives is the market, not the search.
E3 · Program-level probability of backtest overfitting
The Probability of Backtest Overfitting asks how often the best in-sample strategy lands in the bottom half out of sample under Combinatorially-Symmetric Cross-Validation. A value near 0.5 means in-sample ranking carries no out-of-sample information. We ran it on the real net-daily-PnL matrix of each instrument, sampling up to 1,500 activity-filtered strategies per instrument with a fixed seed.
The program-wide probability of backtest overfitting was 0.21. By market it was 0.34 for crypto, 0.11 for foreign exchange, and 0.00 for equities. Equity index rankings are stable out of sample; crypto rankings are close to a coin flip; foreign exchange sits in between. The ordering reproduces the program's earlier per-corpus study on a smaller grid, with the larger and more diverse daily corpus here showing somewhat more overfitting room in crypto and foreign exchange, exactly what one expects when the search space grows.
E3 · overfitting by market
The same diagnostic separates the markets cleanly
Probability of backtest overfitting by market, pooled across instruments via Combinatorially-Symmetric Cross-Validation on the real net-daily-PnL matrices. A value near 0.5 means in-sample ranking carries no out-of-sample information.
| market | PBO | reading |
|---|---|---|
| Equities | 0.00 | rankings stable out of sample |
| Foreign exchange | 0.11 | in between |
| Crypto | 0.34 | close to a coin flip |
| Program-wide | 0.21 | pooled across all instruments |
Program-wide probability of backtest overfitting 0.21 (median per instrument 0.22), with a per-instrument range from 0.00 to 0.63. Backtest overfitting is real and strongly market-dependent.
E4 · The program's best result, against the skill-less expectation
The False Strategy Theorem says the expected maximum Sharpe of N skill-less trials grows with N and with the dispersion of trial Sharpes. We measured all three from the real corpus: the trial count, the empirical dispersion of per-strategy Sharpes, and the effective number of independent trials from the eigenvalue participation ratio of the real correlation structure.
The program searched 985,570 eligible configurations. The real correlation structure is loose (median absolute correlation 0.05), so the effective number of independent trials is 186,139, about 19 percent of nominal. The observed best annualized Sharpe across the entire corpus is 3.21, on the Nasdaq-100 proxy. The expected maximum Sharpe of purely skill-less trials at this scale is 5.61 under the nominal count and 5.22 under the effective count. So the single most striking number in the study: the program's own best result, impressive in isolation, sits below what selection on noise alone would be expected to produce once the roughly one million trials are disclosed. The skill-less curve crosses 3.21 at only about 200 trials. With a search this large, a Sharpe of 3.21 is not even keeping up with chance. This is the thesis in one figure.
E4 · the program best versus the null
Observed best 3.21, expected from noise alone 5.22 to 5.61
The ingredients of the program-scale expected-maximum-Sharpe comparison, all measured from the real corpus. The observed best falls below the skill-less expectation at both the nominal and the effective trial counts.
| quantity | value | note |
|---|---|---|
| Eligible configurations searched | 985,570 | 1.20M in total |
| Effective independent trials | 186,139 | ~19% of nominal; median |corr| 0.05 |
| Observed best Sharpe (ann.) | 3.21 | Nasdaq-100 proxy |
| E[max Sharpe], nominal N | 5.61 | skill-less expectation |
| E[max Sharpe], effective N | 5.22 | skill-less expectation |
| Trials to reach 3.21 by chance | ~200 | where the skill-less curve crosses the observed best |
Effective N is measured per instrument from the eigenvalue participation ratio and summed across instruments, treating instruments as independent blocks, a conservative-low estimate of cross-instrument redundancy. The observed best of 3.21 independently reproduces the program's earlier best-of-corpus figure.
The whole assembly line, chained and run once
The first four experiments dissect the program one station at a time. The capstone's final test does the opposite: it chains every station into a single disclosed system and runs it once, end to end, on a window of time that was never touched during tuning. Information-driven (dollar) bars feed triple-barrier labels; sample-uniqueness weights down-weight overlapping events; a bagged-tree meta-model with a meta-label gate decides when to act; a bet-sizing rule scales the position; Hierarchical Risk Parity allocates across whatever sleeves qualify; and a Deflated Sharpe Ratio deploy gate, set at 0.95 against the disclosed trial count, has the final say. Six instruments, 162 configurations searched in total, the last 20 percent of each series held out, 7 basis points of cost per side.
The disclosed pipeline deployed nothing. Not one instrument's tuning Deflated Sharpe cleared the 0.95 bar, so the gate held cash and realized exactly zero. The Sisyphus pick, the single best in-sample configuration, deployed blindly, lost money out of sample, at an annualized Sharpe of -0.83 and a profit factor of 0.96. The only positive realized stream over the held-out window was plain buy-and-hold, at +1.81. The thesis comes down to a single decision here, and the disclosed gate makes it correctly: it refuses to ship the lone backtester's losing pick.
E5 · one disclosed pipeline, scored once on held-out time
The gate ships nothing, the lone pick loses, only buy-and-hold is positive
Pooled realized result on the held-out out-of-time window across six instruments. The DSR-gated pipeline is the disclosed system; the Sisyphus pick is the lone backtester's best in-sample configuration deployed blindly; buy-and-hold is the passive benchmark.
| process | OOS Sharpe (ann.) | OOS PF | deploys? |
|---|---|---|---|
| assembled pipeline (DSR-gated) | 0.00 | n/a | NO, held cash |
| Sisyphus best-in-sample pick | -0.83 | 0.96 | deploys blindly |
| buy and hold | +1.81 | 1.06 | , |
Out-of-sample is the final 20 percent of each series, never used in tuning; cost is 7 basis points per side; the deploy gate is a Deflated Sharpe Ratio of 0.95 against the 162 disclosed configurations. No instrument cleared the gate, so the pipeline deployed zero sleeves and realized zero.
Reconciling the cuts: the disclosure effect, made explicit
Two true statements look, at first, like a contradiction. The program's per-instrument scorecard reports single-series survivors that cleared a high Deflated Sharpe bar, 19 of 42 instruments in the trend-scanning study and 3 of 42 in the optimal-trading-rules study. Yet the program's single best result has a Deflated Sharpe below 0.03 against the full corpus, and E1's disclosed process deployed nothing.
Both are correct, and the gap between them is the disclosure effect. The per-instrument figure deflates each survivor against that instrument's own, small trial count; the program-best figure deflates the single best result against the full roughly one-million-trial null. The same result can clear a small local null and fail a large global one. That is precisely the effect the assembly line is built to surface: disclose every trial, deflate the winner against the true N, and a number that looked like skill is revealed as selection.
Disclosure also has a constructive side, and the program shows it. Two regimes do clear the bar without leaning on a small local null: the Ornstein–Uhlenbeck rule deflated on its true mean-reverting spread (3 of 5 cointegrated pairs, beating a band control on 4 of 5), and cross-sectional ranking with a gradient-boosted model, which clears on all 10 horizons at a probability of backtest overfitting of 0.10, the program's first DSR-surviving machine-learning edge. Both approach the best static archetype near PF 1.17 rather than beating it, so the honest reading is not “nothing works” but “discipline matches the static edge, and the naive Sisyphus search does not.”
The cost of non-disclosure
Expected maximum Sharpe of skill-less trials versus N
The annualised expected maximum Sharpe of N skill-less trials (False Strategy Theorem, T = 1000 observations per trial). The whole point: with zero real edge, the best of N backtests still climbs steadily with N. This is the benchmark a disclosed N lets the Deflated Sharpe Ratio subtract.
| trials N | E[max Sharpe], annualised | note |
|---|---|---|
| 10 | 0.79 | |
| 100 | 1.27 | |
| 1,000 | 1.63 | |
| 2,500 | 1.76 | lone quant, best of 2,500 |
| 10,000 | 1.94 | |
| 100,000 | 2.20 | |
| 1,000,000 | 2.44 |
Analytic values from the False Strategy Theorem at T = 1000. Validated against a Monte-Carlo of skill-less strategies to within Monte-Carlo error (maximum absolute difference 0.0009 in per-observation Sharpe across N from 10 to 5,000); the Sharpe inner loop is verified bit-identical to an independent reference (max difference 0.0 over 200 random matrices).
Method
- Expected-max-Sharpe curve: the False Strategy Theorem (Bailey & López de Prado) gives E[max SR] for N independent skill-less trials, each with a track of T = 1000 observations and per-trial Sharpe variance ~1/T. The curve is evaluated analytically across N from 2 to 1,000,000 and reported annualised.
- Monte-Carlo validation: for several values of N we draw thousands of independent streams of zero-mean unit-variance returns (no edge by construction), take the maximum sample Sharpe in each universe, and average. This labelled synthetic experiment is the only synthetic data anywhere in the program. It agrees with the analytic formula to within Monte-Carlo error at every N tested (max |Δ| 0.0009 in per-observation Sharpe across N from 10 to 5,000).
- Numerical verification: the Sharpe-of-a-matrix inner loop was implemented twice, once in plain array code and once as a compiled kernel, and the two were verified bit-identical (max difference 0.0 over 200 random input matrices). In this workload the random draw, not the Sharpe computation, dominates run time, so the compiled kernel is correctness insurance rather than a speed win.
- Meta-analysis scorecard: each completed study in the program is mapped to the assembly-line station it stress-tested, and its own result table supplies the configurations evaluated and the count clearing the Deflated Sharpe gate (DSR > 0.95), with Probability of Backtest Overfitting and effective-number-of-trials as supporting diagnostics. The same validation harness was applied identically across every study. Property studies (bars, fractional differentiation, leakage) carry no strategy deflation gate and are excluded from the trial total.
Notes & honest assessment
This is a process and discipline result, not a tradeable edge, and it is framed that way throughout. The expected-max-Sharpe formula shows precisely how much free Sharpe selection buys when N is hidden, and it matches a skill-less Monte-Carlo to within Monte-Carlo error. The program is itself the worked example of the assembly line: it ran every station, logged every trial, and gated everything on the same deflated bar. The honest verdict is two-sided. The null holds broadly across the naive and single-series regimes, the gate rejected almost all of roughly 98,000 configurations there. But the gate is not a blanket no: three disciplined regimes survive deflation, cross-sectional ranking with a gradient-boosted model on every horizon, the Ornstein–Uhlenbeck rule on its true mean-reverting spread, and large-universe risk parity, and each approaches but does not beat the best static archetype near PF 1.17, while the fully-disclosed end-to-end pipeline still deploys nothing in a held-out window. The assembly line plus mandatory disclosure is what separates honest research from the Sisyphus trap, and the strongest evidence for it is a program that disclosed its N and let the numbers say no to the naive search and a measured yes only to discipline. The playground is a pedagogical mechanic anchored to the real curve, not a claim that any number on it is achievable edge.
Reproducibility
The expected-max-Sharpe curve, the Monte-Carlo validation, the bit-identical parity check, the meta-analysis scorecard and the figures are collected in project 16 of lopez-de-prado-work-review. The playground on this page is self-contained: the real curve anchors are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.
Cite
References
The primary sources for the framework synthesized here:
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. (Chapter 1, the research assembly line; and the multiple-testing discipline throughout.)
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).
- Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance, 20(4).
- López de Prado, M. (2020). Machine Learning for Asset Managers. Cambridge Elements in Quantitative Finance.
See also
The shared validation gate is the subject of Backtest Overfitting & the Deflated Sharpe Ratio, the causal-graph station is Causal Factor Investing, the same selection-discipline theme runs through The edge is in the process, and the broader body of work is at Research.

