Research review · research process · study 11

Meta-Strategy Organization

Run research like a factory assembly line of separable, specialized stations with one shared validation gate, and log every trial, because hiding the number of attempts is what turns selection into a fake edge

This is the capstone of the program. López de Prado argues that quant research should be organized like a factory assembly line of specialized, separable stations rather than a lone craftsman, bound by two principles: separability (no one designs a bet and also grades it) and mandatory disclosure (every trial from every station is logged, so a reported winner can be deflated by the true number of attempts N). The anti-pattern he warns against is the lone “Sisyphus” quant who finds strategies by backtesting one idea over and over and reports only the best. The honest evidence for the whole thesis is this program's own scorecard: roughly 98,000 strategy configurations evaluated across the studies, gated on the same deflated bar. The null holds broadly across the naive and single-series regimes, almost nothing clears. But the gate is not a blanket no: three disciplined regimes do survive deflation, cross-sectional ranking with a gradient-boosted model on every horizon, the Ornstein–Uhlenbeck trading rule on its true mean-reverting spread, and large-universe risk parity. They approach but do not beat the best static archetype (about PF 1.17), and the fully-disclosed end-to-end pipeline still deploys nothing in a held-out window. The playground below shows exactly how much free Sharpe selection buys when N is hidden.

coming soon

Working paper

write-up in review, not yet posted

Code & data

lopez-de-prado-work-review / projects / 16

the claim

What López de Prado says

Quant research should run like a factory assembly line of specialized, separable stations, data curators, feature analysts, strategists, backtesters and a validation station, deployment and sizing, portfolio oversight, not like a lone craftsman who does everything.
Two principles bind the line. Separability: no one designs a bet and also grades it, because the person who builds a strategy and judges it will judge it kindly. Mandatory disclosure: every trial from every station is logged, so when a winner is reported the validation station knows the true number of trials N and can deflate the result.
The warning is the lone Sisyphus quant who discovers strategies by repeatedly backtesting one idea and reporting only the best. The reason it fails is mathematical, not a matter of willpower: try enough configurations, report only the winner, and the winner looks strong even with zero real edge. Hiding N is what makes the number meaningless; disclosing N is the only thing that lets anyone correct for it.

our result

What we found

The lone backtester earns nothing out of sample. Across 1,208 walk-forward windows on 39 instruments, picking the single best in-sample strategy each quarter and deploying it gave a pooled realized out-of-sample Sharpe of negative 0.02 (95 percent band negative 0.21 to positive 0.18). The picks had a median in-sample Sharpe of 3.09 annualized, so the in-sample-to-out-of-sample decay was 2.61 Sharpe, the winner's curse erases the whole apparent edge. Out-of-sample Sharpe was negative for 82 percent of instruments and the median hit rate was 0.40, worse than a coin flip. The disclosed process, refusing to deploy any single strategy that fails deflation, deployed nothing.
Diversification buys deflated edge only where the market has structure. Combining many weakly-correlated strategies lowered the median in-shortlist correlation to 0.11 and raised the effective number of independent bets to about 17, but the pooled fixed-policy sleeve still realized an out-of-sample Sharpe of negative 0.64. Only 3 of 39 instruments cleared the 0.95 portfolio-level Deflated Sharpe bar, and all three are broad equity index funds (the Nasdaq-100 proxy at OOS Sharpe 2.50 and DSR 1.00, the S&P-500 proxy at 1.96, the small-cap proxy at 0.85). Every crypto, FX, sector, and volatility sleeve was negative. A portfolio of noise is still noise.
The program's own best result falls below the skill-less expectation. Backtest overfitting, measured program-wide by Combinatorially-Symmetric Cross-Validation, was 0.21 (0.34 crypto, 0.11 FX, 0.00 equity). And across 985,570 eligible configurations the observed best annualized Sharpe is 3.21, while the expected maximum Sharpe of purely skill-less trials at this scale is 5.22 to 5.61. So the program's best result, impressive in isolation, sits below what selection on noise alone would produce once the search is disclosed, the curve crosses 3.21 at only about 200 trials. That gap, disclosed plainly, is the thesis.
But the null is not absolute, discipline survives where naive search does not. Three regimes clear the deflation gate honestly. Cross-sectional ranking with a gradient-boosted model clears the Deflated Sharpe bar on 10 of 10 holding horizons (probability of backtest overfitting 0.10, in-sample-to-out-of-sample rank correlation 0.78, median out-of-sample profit factor about 1.12), the program's first DSR-surviving machine-learning edge. The Ornstein–Uhlenbeck optimal-trading-rule, run on its true regime (the residual spread of a cointegrated pair rather than a single series), clears DSR on 3 of 5 pairs at a median profit factor of 2.11 and beats a Bollinger-band control on 4 of 5. And large-universe hierarchical risk parity shows a real out-of-sample variance advantage over equal weight that grows with breadth (HRP-to-1/N variance ratio 0.87 at 25 names, 0.45 at 200). Each of these approaches but does not beat the best static structural archetype at about PF 1.17, and the fully-disclosed end-to-end pipeline still deploys nothing in a held-out window, so the honest reading is that disciplined ML matches the static edge without dominating it.

The result in three lines

E1 · 1,208 windows · 39 instruments

The lone backtester earns nothing out of sample

Picking the single best in-sample strategy each quarter and deploying it gave a pooled realized out-of-sample Sharpe of negative 0.02 (95 percent band negative 0.21 to +0.18). The picks had a median in-sample Sharpe of 3.09, so the decay was 2.61 Sharpe, the winner's curse erases the whole edge. Out-of-sample Sharpe was negative on 82 percent of instruments; the median hit rate was 0.40. The disclosed process, refusing any single strategy that fails deflation, deployed nothing in all 1,208 windows.

E2 · 39 instruments · ~17 effective bets

Diversification pays only where the market has structure

Combining weakly-correlated bets cut the median in-shortlist correlation to 0.11 and raised the effective bet count to about 17, but the pooled fixed-policy sleeve still realized an out-of-sample Sharpe of negative 0.64. Only 3 of 39 instruments cleared the 0.95 portfolio Deflated Sharpe bar, and all three are broad equity index funds (QQQ at OOS 2.50 / DSR 1.00, SPY 1.96, IWM 0.85). Every crypto, FX, sector, and volatility sleeve was negative.

E4 · 985,570 trials · best 3.21

The program's best is below the skill-less expectation

Program-wide backtest overfitting was 0.21 (0.34 crypto, 0.11 FX, 0.00 equity). Across 985,570 eligible configurations the observed best annualized Sharpe is 3.21, while pure selection on noise at this scale would be expected to produce 5.22 to 5.61. The best result sits below the skill-less expectation once the search is disclosed; the curve crosses 3.21 at only about 200 trials. That gap is the disclosure effect, and the thesis.

Discipline · 3 regimes clear DSR

But the null is not absolute, discipline survives

The gate is not a blanket no. Cross-sectional ranking with a gradient-boosted model clears the Deflated Sharpe bar on 10 of 10 horizons (PBO 0.10, IS→OOS rank ρ 0.78, median PF ~1.12), the program's first DSR-surviving ML edge. OU on its true mean-reverting spread clears DSR on 3 of 5 cointegrated pairs (median PF 2.11). Large-universe risk parity's variance advantage over 1/N grows with breadth (0.87 → 0.45 from 25 to 200 names). Each approaches but does not beat the best static archetype (~PF 1.17), and the fully-disclosed pipeline still deploys nothing held out.

Disclose your trials

Drag the number of configurations N a researcher quietly searched. The curve is the expected maximum Sharpe of skill-less strategies; with no real edge, the best of N still climbs with N. The lone-quant marker sits at 2,500 silent trials. This is a pedagogical mechanic, not a tradeable claim.

The point of the playground is that selection masquerades as skill unless every trial is disclosed. Slide N up and the expected maximum Sharpe of a no-edge search climbs with it, by 2,500 trials it is near 1.8, a number that would look like a serious strategy. A lone quant who reports only the best of those trials and hides N hands you that figure as if it were skill. Disclosing N is the one thing that lets the Deflated Sharpe Ratio subtract this benchmark and deflate the winner toward the 0.95 chance bar. That disclosure discipline is exactly what the assembly line institutionalizes.

Demo, disclose your trials

Drag the number of configurations N a researcher quietly searched. The curve is the expected maximum Sharpe of skill-less strategies (False Strategy Theorem, T = 1000 observations per trial): with zero real edge by construction, the best of N backtests still climbs with N. A lone quant who reports only the best and hides N hands you that number as if it were skill. Disclosing N is the one thing that lets the Deflated Sharpe Ratio subtract this benchmark.

DSR · T = 1000

trials searched N…

101e31e51e6

N is the true number of configurations tried, including every silent re-run.

best Sharpe reported1.76

0.51.52.53.5

The single best of N backtests, annualised, the one number that gets published.

best Sharpe reported

1.76

expected max Sharpe (no skill)

1.76

the significance gap…

naive read…

looks significant

honest read (deflated)…

……

Searching … → 1.76. skill-less configurations and reporting only the best buys this much annualised Sharpe from pure selection. Until N is disclosed, there is no way to tell it from skill; once it is, the Deflated Sharpe Ratio subtracts the benchmark and the winner deflates toward the 0.95 chance bar.

Fig. 1:The Sisyphus trap. With zero true edge by construction, the best of N backtests still climbs with N; selection alone buys about 1.8 annualised Sharpe at 2,500 silent trials. Monte-Carlo points of skill-less strategies sit on the analytic curve. Disclosing N lets the Deflated Sharpe Ratio subtract this benchmark.

The Sisyphus trap, quantified

The False Strategy Theorem gives the expected maximum Sharpe of N skill-less trials, each estimated over a track of T observations. With zero true edge by construction, the best of N backtests still climbs steadily with N, and that climb is pure selection. At 2,500 silent trials, the size of this program's per-instrument crypto corpus, selection alone buys about 1.8 annualised Sharpe. If the quant never tells you N, you have no way to know the 1.8 is a mirage. Disclosing N is exactly what lets the Deflated Sharpe Ratio subtract this benchmark and reveal the winner as luck.

Six separable stations, one shared gate

The schematic lays out the six stations and maps each completed study to the station it stress-tested: data curators (information-driven bars, fractional differentiation), feature analysts (structural breaks and entropy, microstructural features, causal factors), strategists (triple-barrier and meta-labeling, trend-scanning labels), backtesters and validation (purged cross-validation, ensembles and importance, and the overfitting and Deflated Sharpe harness itself), deployment and sizing (bet sizing, optimal trading rules), and portfolio oversight (portfolio construction). No single study did everything, and no study graded its own work, because the validation station was shared and applied identically. The disclosure loop logs every trial so the gate can deflate by the true N.

Fig. 2:Six specialized, separable stations feeding one shared validation gate, with each completed study mapped to the station it stress-tested and a disclosure loop that logs every trial so the gate can deflate by the true N. The opposite is the lone quant who hides N and overfits by construction.

Fig. 3:Configurations evaluated versus configurations clearing the deflation gate, per study. The naive and single-series search clears almost nothing; three disciplined regimes, cross-sectional ranking, OU on its true spread, large-universe risk parity, do clear, but approach rather than beat the best static archetype.

The program scorecard: the gate held broadly, but discipline cleared it

Configurations evaluated versus configurations clearing the deflation gate, per study. Across the naive and single-series studies the line evaluated roughly 98,000 configurations and almost none cleared; the 92,500-strategy overfitting corpus clears in 0 of 3 markets. But the gate is not a blanket no. Three disciplined regimes survive deflation honestly: cross-sectional ranking with a gradient-boosted model clears DSR on 10 of 10 horizons (the program's first DSR-surviving machine-learning edge), the Ornstein–Uhlenbeck rule on its true mean-reverting spread clears 3 of 5 cointegrated pairs, and large-universe risk parity's out-of-sample variance advantage over equal weight grows with breadth. Each approaches but does not beat the best static archetype near PF 1.17. The factory ran end to end, the gate said no to the naive search and yes only to discipline, which is the discipline working as intended.

The program as an assembly line, a meta-analysis

Every study, gated on one deflated bar

Every number below is read from the studies' own result tables. The verdict is two-sided: the naive and single-series studies, roughly 98,000 configurations, clear almost nothing, while three disciplined regimes do survive the same deflated bar (cross-sectional ranking with a gradient-boosted model, the Ornstein–Uhlenbeck rule on its true mean-reverting spread, and large-universe risk parity). The data-representation studies (information-driven bars, fractional differentiation) and the leakage study are statistical-property work, cost-free by nature, so they carry no strategy-level deflation gate and are excluded from the trial total and marked as such.

study	station	trials	clear DSR>0.95	claim reproduced
Backtest overfitting & DSR	Validation	92,500	0 / 3 markets	Yes, best Sharpe below the null E[max] in all 3 markets; DSR 0.00–0.03
Information-driven bars	Data representation	,	n/a	Yes, excess kurtosis collapses toward Gaussian in all 3 markets (property study)
Fractional differentiation	Data representation	,	n/a	Yes, FFD keeps ~0.98 memory while buying stationarity (property study)
Triple-barrier + meta-labeling	Labeling	1,134	0 / 42	Partly, mechanics reproduce; meta-label adds no deflated edge
Purged CV / CPCV leakage	Validation	,	n/a	Yes, naive k-fold inflates the score; purging removes it (property study)
Trend-scanning labels	Labeling	378	19 / 42	Partly, DSR hits exist but the labeller ties its fixed-horizon control (19 vs 18 of 42)
Structural breaks & entropy	Features	108	0	Yes as features, but not tradeable; best DSR 0.002
Microstructural features	Features	144	0	Partly, estimators cheap, but nothing tradeable after costs
Cross-sectional ML (LightGBM ranking)	Model / strategy	10 horizons	10 / 10	Yes, the program's first DSR-surviving ML edge; PBO 0.10, IS→OOS rank ρ 0.78, median PF ~1.12; approaches but does not beat the best static archetype
Ensembles & feature importance	Model / importance	1,280	0 / 40	Yes on the science, bagging gap ~5× smaller, MDA less biased; no deflated edge
Bet sizing	Sizing	1,344	0 / 42	Partly, sizing changes turnover, not deflated performance (PBO ~0.49)
Optimal trading rules (OU on a cointegrated spread)	Sizing / exits	5 pairs	3 / 5	Yes in its true regime, on the mean-reverting residual spread the OU rule clears DSR on 3 of 5 pairs (median PF 2.11) and beats the band control on 4 of 5; the single-series mesh still ties its IS-tuned control
Portfolio construction (HRP / large universe)	Allocation	5 sizes	ranking	Yes on ranking, HRP's out-of-sample variance advantage over 1/N grows with breadth (HRP/1N variance 0.87 at N=25 → 0.45 at N=200); 0 clear an absolute DSR on this net-noisy panel
Causal factor investing	Features / validation	9	0	Yes, MC reproduces the bias; no real factor survives adjustment + deflation (best DSR 0.45)

“Trials” counts strategy configurations evaluated; property studies carry no strategy deflation gate and are excluded from the ~98,000 naive-regime total. The only synthetic data anywhere in the program is the labelled Monte-Carlo that validates the expected-max-Sharpe formula; everything else is real market data. The disciplined regimes that clear the gate, cross-sectional ranking (10 of 10 horizons), OU on its true cointegrated spread (3 of 5 pairs), and large-universe risk parity, approach but do not beat the best static archetype near PF 1.17; the older single-series trend-scan and OU-mesh bests still tie their own in-sample-tuned controls.

Four experiments on the program's own corpus

About 1.20 million strategy configurations, across 42 instruments, put on trial

The scorecard above is the program reading its own result tables. The four experiments below go further: they re-run the organizational thesis as a controlled, after-cost experiment on the program's own corpus of real strategy results, about 1.20 million strategy configurations across 42 instruments spanning crypto, equities, and foreign exchange, with every realized stream net of costs and purged of look-ahead by construction. Each experiment answers one question the thesis raises, and where the result is negative it is reported as negative.

Fig. 4:Sisyphus versus the disclosed assembly line. Left, every instrument's best in-sample pick falls far below the no-decay line, the winner's curse made visible. Right, the lone backtester's pooled realized out-of-sample equity round-trips back to about zero while the disclosed discipline, refusing to deploy any single strategy that fails deflation, holds flat.

E1 · Sisyphus versus the disclosed assembly line

Across 1,208 walk-forward windows on 39 instruments (one year in-sample, one quarter out-of-sample, stepped quarterly), we ran two research processes side by side on the same candidate pool. The lone backtester picks the single best in-sample strategy each window and deploys it; the disclosed assembly line gates the pool on the False Strategy Theorem null for the full number of trials searched and deploys only strategies that clear the Deflated Sharpe bar.

The lone backtester's picks had a median in-sample Sharpe of 3.09 annualized and a pooled realized out-of-sample Sharpe of negative 0.02 (95 percent band negative 0.21 to positive 0.18). The mean in-sample-to-out-of-sample decay was 2.61 annualized Sharpe, the winner's curse erases the entire apparent edge. Realized out-of-sample Sharpe was negative for 82 percent of instruments, and the median hit rate was 0.40, worse than a coin flip. The disclosed process, insisting that any single strategy clear the disclosed-trial null, deployed nothing in any of the 1,208 windows; the best single strategy's Deflated Sharpe was effectively zero every window (per-window maximum 0.0001), and relaxing the bar from 0.95 to 0.60 changed nothing. Refusing to deploy is not a failure of the discipline; it is the discipline correctly reporting that no single backtest survives once you admit how many were tried.

E2 · The meta-strategy portfolio

López de Prado's constructive answer is not one great strategy but many weak, weakly-correlated ones combined, deflating the sleeve rather than each part. We built diversified shortlists per instrument and window, allocated four ways, and deflated the sleeve against a portfolio-level null.

The diversification was real: the median in-shortlist absolute correlation fell to 0.11 and the effective number of independent bets rose to about 17. But combining weak bets mostly combined noise. The pooled fixed-policy sleeve had a realized out-of-sample Sharpe of negative 0.64, and only 3 of 39 instruments cleared the 0.95 Deflated Sharpe bar at the portfolio level, and all three are broad equity index funds. Every crypto, foreign-exchange, sector, and volatility sleeve was negative. Diversification lowers correlation and raises the effective bet count exactly as advertised; whether that buys deflated edge depends entirely on whether the market carries persistent structure after costs. A portfolio of noise is still noise.

Fig. 5:The meta-strategy portfolio. Diversified sleeves reach low correlation across all markets, but only broad equity indices turn that into positive out-of-sample Sharpe.

Fig. 5b:Portfolio-level Deflated Sharpe by instrument; only the three equity index sleeves clear the 0.95 bar.

E2 · the only sleeves that cleared

Three broad equity indices, and nothing else

Of 39 instruments, only three cleared the 0.95 portfolio-level Deflated Sharpe bar, and all three are broad equity index funds. Every crypto, foreign-exchange, sector, and volatility sleeve had a negative realized out-of-sample Sharpe and a portfolio Deflated Sharpe at or near zero.

sleeve	what it is	OOS Sharpe (ann.)	portfolio DSR
QQQ	Nasdaq-100 index proxy	2.50	1.00
SPY	S&P 500 index proxy	1.96	1.00
IWM	small-cap (Russell 2000) proxy	0.85	1.00

Realized out-of-sample Sharpe of the best deflated sleeve, with the portfolio-level Deflated Sharpe Ratio. The pooled fixed-policy sleeve across all 39 instruments was negative 0.64; the median portfolio Deflated Sharpe was essentially zero. The edge that survives is the market, not the search.

Fig. 6:Program probability of backtest overfitting. The cross-validation out-of-sample logit distribution by market: equity mass sits far to the right (overfitting near zero), while crypto and foreign exchange straddle the boundary, giving a program-wide overfitting probability of 0.21.

E3 · Program-level probability of backtest overfitting

The Probability of Backtest Overfitting asks how often the best in-sample strategy lands in the bottom half out of sample under Combinatorially-Symmetric Cross-Validation. A value near 0.5 means in-sample ranking carries no out-of-sample information. We ran it on the real net-daily-PnL matrix of each instrument, sampling up to 1,500 activity-filtered strategies per instrument with a fixed seed.

The program-wide probability of backtest overfitting was 0.21. By market it was 0.34 for crypto, 0.11 for foreign exchange, and 0.00 for equities. Equity index rankings are stable out of sample; crypto rankings are close to a coin flip; foreign exchange sits in between. The ordering reproduces the program's earlier per-corpus study on a smaller grid, with the larger and more diverse daily corpus here showing somewhat more overfitting room in crypto and foreign exchange, exactly what one expects when the search space grows.

E3 · overfitting by market

The same diagnostic separates the markets cleanly

Probability of backtest overfitting by market, pooled across instruments via Combinatorially-Symmetric Cross-Validation on the real net-daily-PnL matrices. A value near 0.5 means in-sample ranking carries no out-of-sample information.

market	PBO	reading
Equities	0.00	rankings stable out of sample
Foreign exchange	0.11	in between
Crypto	0.34	close to a coin flip
Program-wide	0.21	pooled across all instruments

Program-wide probability of backtest overfitting 0.21 (median per instrument 0.22), with a per-instrument range from 0.00 to 0.63. Backtest overfitting is real and strongly market-dependent.

E4 · The program's best result, against the skill-less expectation

The False Strategy Theorem says the expected maximum Sharpe of N skill-less trials grows with N and with the dispersion of trial Sharpes. We measured all three from the real corpus: the trial count, the empirical dispersion of per-strategy Sharpes, and the effective number of independent trials from the eigenvalue participation ratio of the real correlation structure.

The program searched 985,570 eligible configurations. The real correlation structure is loose (median absolute correlation 0.05), so the effective number of independent trials is 186,139, about 19 percent of nominal. The observed best annualized Sharpe across the entire corpus is 3.21, on the Nasdaq-100 proxy. The expected maximum Sharpe of purely skill-less trials at this scale is 5.61 under the nominal count and 5.22 under the effective count. So the single most striking number in the study: the program's own best result, impressive in isolation, sits below what selection on noise alone would be expected to produce once the roughly one million trials are disclosed. The skill-less curve crosses 3.21 at only about 200 trials. With a search this large, a Sharpe of 3.21 is not even keeping up with chance. This is the thesis in one figure.

Fig. 7:Expected maximum Sharpe versus number of trials, built from the real dispersion of strategy Sharpes and the real effective number of independent trials. The observed best of 3.21 sits below the skill-less expectation at the program's nominal and effective trial counts; the curve crosses 3.21 at only about 200 trials.

E4 · the program best versus the null

Observed best 3.21, expected from noise alone 5.22 to 5.61

The ingredients of the program-scale expected-maximum-Sharpe comparison, all measured from the real corpus. The observed best falls below the skill-less expectation at both the nominal and the effective trial counts.

quantity	value	note
Eligible configurations searched	985,570	1.20M in total
Effective independent trials	186,139	~19% of nominal; median \|corr\| 0.05
Observed best Sharpe (ann.)	3.21	Nasdaq-100 proxy
E[max Sharpe], nominal N	5.61	skill-less expectation
E[max Sharpe], effective N	5.22	skill-less expectation
Trials to reach 3.21 by chance	~200	where the skill-less curve crosses the observed best

Effective N is measured per instrument from the eigenvalue participation ratio and summed across instruments, treating instruments as independent blocks, a conservative-low estimate of cross-instrument redundancy. The observed best of 3.21 independently reproduces the program's earlier best-of-corpus figure.

Fig. 8:The research assembly line chained as one disclosed system and run once on a never-touched out-of-time window. Top: nine stations from dollar bars to the Deflated Sharpe deploy gate; 162 configurations disclosed, zero sleeves cleared the gate. Bottom: realized out-of-time equity, the gated pipeline holds cash (flat), the Sisyphus pick drifts negative, and only buy-and-hold rises.

The whole assembly line, chained and run once

The first four experiments dissect the program one station at a time. The capstone's final test does the opposite: it chains every station into a single disclosed system and runs it once, end to end, on a window of time that was never touched during tuning. Information-driven (dollar) bars feed triple-barrier labels; sample-uniqueness weights down-weight overlapping events; a bagged-tree meta-model with a meta-label gate decides when to act; a bet-sizing rule scales the position; Hierarchical Risk Parity allocates across whatever sleeves qualify; and a Deflated Sharpe Ratio deploy gate, set at 0.95 against the disclosed trial count, has the final say. Six instruments, 162 configurations searched in total, the last 20 percent of each series held out, 7 basis points of cost per side.

The disclosed pipeline deployed nothing. Not one instrument's tuning Deflated Sharpe cleared the 0.95 bar, so the gate held cash and realized exactly zero. The Sisyphus pick, the single best in-sample configuration, deployed blindly, lost money out of sample, at an annualized Sharpe of -0.83 and a profit factor of 0.96. The only positive realized stream over the held-out window was plain buy-and-hold, at +1.81. The thesis comes down to a single decision here, and the disclosed gate makes it correctly: it refuses to ship the lone backtester's losing pick.

E5 · one disclosed pipeline, scored once on held-out time

The gate ships nothing, the lone pick loses, only buy-and-hold is positive

Pooled realized result on the held-out out-of-time window across six instruments. The DSR-gated pipeline is the disclosed system; the Sisyphus pick is the lone backtester's best in-sample configuration deployed blindly; buy-and-hold is the passive benchmark.

process	OOS Sharpe (ann.)	OOS PF	deploys?
assembled pipeline (DSR-gated)	0.00	n/a	NO, held cash
Sisyphus best-in-sample pick	-0.83	0.96	deploys blindly
buy and hold	+1.81	1.06	,

Out-of-sample is the final 20 percent of each series, never used in tuning; cost is 7 basis points per side; the deploy gate is a Deflated Sharpe Ratio of 0.95 against the 162 disclosed configurations. No instrument cleared the gate, so the pipeline deployed zero sleeves and realized zero.

Reconciling the cuts: the disclosure effect, made explicit

Two true statements look, at first, like a contradiction. The program's per-instrument scorecard reports single-series survivors that cleared a high Deflated Sharpe bar, 19 of 42 instruments in the trend-scanning study and 3 of 42 in the optimal-trading-rules study. Yet the program's single best result has a Deflated Sharpe below 0.03 against the full corpus, and E1's disclosed process deployed nothing.

Both are correct, and the gap between them is the disclosure effect. The per-instrument figure deflates each survivor against that instrument's own, small trial count; the program-best figure deflates the single best result against the full roughly one-million-trial null. The same result can clear a small local null and fail a large global one. That is precisely the effect the assembly line is built to surface: disclose every trial, deflate the winner against the true N, and a number that looked like skill is revealed as selection.

Disclosure also has a constructive side, and the program shows it. Two regimes do clear the bar without leaning on a small local null: the Ornstein–Uhlenbeck rule deflated on its true mean-reverting spread (3 of 5 cointegrated pairs, beating a band control on 4 of 5), and cross-sectional ranking with a gradient-boosted model, which clears on all 10 horizons at a probability of backtest overfitting of 0.10, the program's first DSR-surviving machine-learning edge. Both approach the best static archetype near PF 1.17 rather than beating it, so the honest reading is not “nothing works” but “discipline matches the static edge, and the naive Sisyphus search does not.”

The cost of non-disclosure

Expected maximum Sharpe of skill-less trials versus N

The annualised expected maximum Sharpe of N skill-less trials (False Strategy Theorem, T = 1000 observations per trial). The whole point: with zero real edge, the best of N backtests still climbs steadily with N. This is the benchmark a disclosed N lets the Deflated Sharpe Ratio subtract.

trials N	E[max Sharpe], annualised	note
10	0.79
100	1.27
1,000	1.63
2,500	1.76	lone quant, best of 2,500
10,000	1.94
100,000	2.20
1,000,000	2.44

Analytic values from the False Strategy Theorem at T = 1000. Validated against a Monte-Carlo of skill-less strategies to within Monte-Carlo error (maximum absolute difference 0.0009 in per-observation Sharpe across N from 10 to 5,000); the Sharpe inner loop is verified bit-identical to an independent reference (max difference 0.0 over 200 random matrices).

Method

Expected-max-Sharpe curve: the False Strategy Theorem (Bailey & López de Prado) gives E[max SR] for N independent skill-less trials, each with a track of T = 1000 observations and per-trial Sharpe variance ~1/T. The curve is evaluated analytically across N from 2 to 1,000,000 and reported annualised.
Monte-Carlo validation: for several values of N we draw thousands of independent streams of zero-mean unit-variance returns (no edge by construction), take the maximum sample Sharpe in each universe, and average. This labelled synthetic experiment is the only synthetic data anywhere in the program. It agrees with the analytic formula to within Monte-Carlo error at every N tested (max |Δ| 0.0009 in per-observation Sharpe across N from 10 to 5,000).
Numerical verification: the Sharpe-of-a-matrix inner loop was implemented twice, once in plain array code and once as a compiled kernel, and the two were verified bit-identical (max difference 0.0 over 200 random input matrices). In this workload the random draw, not the Sharpe computation, dominates run time, so the compiled kernel is correctness insurance rather than a speed win.
Meta-analysis scorecard: each completed study in the program is mapped to the assembly-line station it stress-tested, and its own result table supplies the configurations evaluated and the count clearing the Deflated Sharpe gate (DSR > 0.95), with Probability of Backtest Overfitting and effective-number-of-trials as supporting diagnostics. The same validation harness was applied identically across every study. Property studies (bars, fractional differentiation, leakage) carry no strategy deflation gate and are excluded from the trial total.

Notes & honest assessment

This is a process and discipline result, not a tradeable edge, and it is framed that way throughout. The expected-max-Sharpe formula shows precisely how much free Sharpe selection buys when N is hidden, and it matches a skill-less Monte-Carlo to within Monte-Carlo error. The program is itself the worked example of the assembly line: it ran every station, logged every trial, and gated everything on the same deflated bar. The honest verdict is two-sided. The null holds broadly across the naive and single-series regimes, the gate rejected almost all of roughly 98,000 configurations there. But the gate is not a blanket no: three disciplined regimes survive deflation, cross-sectional ranking with a gradient-boosted model on every horizon, the Ornstein–Uhlenbeck rule on its true mean-reverting spread, and large-universe risk parity, and each approaches but does not beat the best static archetype near PF 1.17, while the fully-disclosed end-to-end pipeline still deploys nothing in a held-out window. The assembly line plus mandatory disclosure is what separates honest research from the Sisyphus trap, and the strongest evidence for it is a program that disclosed its N and let the numbers say no to the naive search and a measured yes only to discipline. The playground is a pedagogical mechanic anchored to the real curve, not a claim that any number on it is achievable edge.

Reproducibility

The expected-max-Sharpe curve, the Monte-Carlo validation, the bit-identical parity check, the meta-analysis scorecard and the figures are collected in project 16 of lopez-de-prado-work-review. The playground on this page is self-contained: the real curve anchors are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.

Cite

Cite as

Gatto, D. V. (2026). Meta-Strategy Organization: The Research Assembly Line and Mandatory Disclosure. Working paper. Review and synthesis of López de Prado's organizational thesis.

@techreport{gatto2026metastrategy,
  author      = {Gatto, Daniel V.},
  title       = {Meta-Strategy Organization: The Research Assembly
                 Line and Mandatory Disclosure},
  year        = {2026},
  type        = {Working paper},
  note        = {Review and synthesis of Lopez de Prado's
                 organizational thesis}
}

References

The primary sources for the framework synthesized here:

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. (Chapter 1, the research assembly line; and the multiple-testing discipline throughout.)
Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).
Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance, 20(4).
López de Prado, M. (2020). Machine Learning for Asset Managers. Cambridge Elements in Quantitative Finance.