Research review · López de Prado · labeling & cross-validation
Labeling & Cross-Validation
How you label and how you validate manufactures phantom edge, or dissolves it. Three multi-market, costed, DSR-gated studies on getting both right.
Two decisions sit upstream of every financial machine-learning result: how you turn raw prices into a learnable target (the label), and how you hold out data to score the model (the cross-validation). Get them wrong and a backtest manufactures edge that was never there; get them right and a lot of apparent edge dissolves. This page reproduces three of López de Prado's methods across the same three real, fully-costed markets, crypto, US equities and forex, 42 instruments, and reports them against one honest bar, the Deflated Sharpe Ratio. The interactive explorer below lets you watch standard k-fold leak, and see why a single backtest split barely means anything.
Advances in Financial ML, Ch. 3 & 7
López de Prado (2018) · labeling & cross-validation
Code & data, three reproductions
github.com/DaruFinance/lopez-de-prado-work-review · projects 03 / 04 / 05
Cross-validation
the claim
What López de Prado says
Standard k-fold assumes IID observations, but financial labels are built from a window of future bars, so overlapping labels share information. Drop a test fold mid-series and the neighbouring train points reach into it, the model trains on the answers and the score is optimistically biased. His fixes: purge overlapping labels, embargo a band after the test fold, and use Combinatorial Purged CV for a full distribution of OOS paths.
our result
What we found
The leak is real and directionally correct but small with a sensible model, sub-2pp even at large overlap, visible cleanly in AUC (not accuracy). We make explicit a law LdP states implicitly: inflation tracks H / fold-size, so short-horizon labels on long equity histories barely leak. The bigger lesson is CPCV's: the OOS-path distribution spans ~15pp, ~50× the leakage bias, a single-split backtest is one near-meaningless draw.
Triple-barrier & meta-labeling
the claim
What López de Prado says
Label by which of three barriers a vol-scaled bet touches first (profit-take / stop-loss / max-hold). Then meta-labeling: a primary model fixes the side, and a secondary classifier predicts whether the bet is profitable, deciding whether to act and the size, raising precision and converting a high-recall, low-precision primary into something tradeable.
our result
What we found
Meta-labeling does exactly what he says, it lifts precision and raw profit factor in 38/42 instruments and flips a cost-losing primary toward break-even. But on a vanilla EMA-crossover primary it is a precision filter, not an alpha source: 0 of 42 instruments clear DSR > 0.95. A filter can only concentrate edge that already exists.
Trend-scanning labels
the claim
What López de Prado says
A fixed holding horizon is arbitrary and throws away information. For each observation, take the sign of the most statistically significant local forward trend, the OLS slope t-value maximised over a band of look-forward horizons, and carry that significance as a sample weight, letting the data pick where the trend is clearest.
our result
What we found
A modest, real improvement on profit factor (beats fixed-horizon in 28/42, p=0.022) but not significant once deflated, DSR>0.95 survivors are essentially tied (19 vs 18). The transferable effect is the look-forward band, not the labeller: short (5,30)-bar horizons collapse everywhere while (10,60)/(20,120) hold. A triple-barrier meta-overlay adds no deflated value here.
Three studies, three lines
Cross-validation
Standard k-fold leaks; purging removes it
Ordinary k-fold scores the same financial ML model optimistically because overlapping labels share future bars. The bias shows up in AUC where accuracy can't see it, and it scales with the H/fold-size ratio, up to +1.69 pp AUC in short-history crypto, flat in long-history equities. Purged k-fold + embargo removes it. And CPCV shows the deeper problem: its out-of-sample distribution spans ~15 pp, roughly 50× the leakage bias, so one split is one arbitrary point.
Meta-labeling
A precision filter, not alpha
A secondary model that decides whether to act on a primary signal lifts raw profit factor in 38/42 instruments and DSR in 39/42, exactly the precision lift López de Prado describes. But 0 of 42 clear DSR>0.95 for either the primary or the meta variant. Meta-labeling can only concentrate edge the primary already has; on a vanilla crossover there is nothing to concentrate into a deflated win.
Trend-scanning
The edge is the horizon, not the labeller
Data-chosen trend labels beat fixed-horizon labels on profit factor in 28/42 instruments (sign-test p=0.022), but not on the metric that counts: DSR>0.95 survivors are 19 (trend-scan) vs 18 (fixed) vs 19 (triple-barrier), a tie. What actually matters is the look-forward band: short (5,30)-bar windows collapse everywhere; (10,60)/(20,120) hold across all three markets.
The shared discipline
All three studies run on the same footing: real 1-minute data turned into information-driven bars (dollar bars for crypto and equities, tick bars for forex), every position change costed (crypto 7 bp, equities 2 bp / realism spread, forex 1 bp / realism spread), every forward-looking object confined to the label, every secondary model trained under purged k-fold CV so overlapping labels cannot leak, and every verdict read off the Deflated Sharpe Ratio after honest trial-counting, never a single backtest number.
Standard k-fold leaks in finance, and a single split barely means anything
Financial labels are built from a window of future bars, so two labels whose windows overlap share information. Drop a test fold in the middle of the series and the neighbouring training points reach into it, the model is trained on the answers, and the cross-validated score comes out optimistically biased.
López de Prado's fixes: purge every training observation whose label window overlaps the test fold, embargo a small band right after it, and use CPCV to recover a whole distribution of out-of-sample paths instead of one.
The reproduction confirms all three claims, and sharpens one. The leak is real but it is small and subtle, a few tenths of a percentage point at typical horizons, which is exactly what makes it dangerous: it survives naive sanity checks. It is visible in AUC, which uses the full predicted probability, and effectively invisible in accuracy, whose fold-to-fold noise (±0.5–1 pp) swamps it. Run a leakage audit on a smooth, probability-based metric, not on a thresholded one.
The contribution beyond the textbook is the scaling law: leakage magnitude tracks the ratio H / fold-size, not H alone. A test fold is contaminated only in a band of width ~H at each of its two boundaries, so the contaminated fraction is ~2H/(n/k). Short-history crypto and forex (fold ≈ 1.4k bars) leak materially, AUC inflation rises to +1.69 pp (crypto) and +1.07 pp (forex) at large overlap, while long-history equities (fold ≈ 7k bars) stay flat: even H=150 is only ~2% of a fold. Equities are not immune; their long histories keep the boundary fraction tiny. Purge always, it is free and correct, but expect the size of the correction to scale with H/fold-size.
Watch it leak
The explorer encodes the study's measured AUC-inflation grid. Slide the label overlap H and read the leaky-minus-purged gap per asset class; the lower panel shows the CPCV out-of-sample cloud for BTCUSDT, against the single split that pretends to be the backtest.
Demo, k-fold leakage explorer
Slide the label overlap H, the number of future bars two neighbouring labels share, and watch standard k-fold score the same model optimistically against a purged + embargoed split. The gap is real, and it grows with the H/fold-size ratio: it takes off in short-history crypto and forex, and stays flat in long-history equities.
Inflation is the AUC a leaky standard k-fold reports above the clean purged + embargo split, in percentage points (measured, from the study’s grid). When the split is purged correctly the leak is removed at source: the “clean” reading is the honest one.
At H=50 the standard k-fold over-scores crypto by +0.53 pp of AUC and forex by +0.45 pp, while equities stay essentially flat (-0.47 pp), the same H/fold-size law, not market immunity. And whatever the leak, the CPCV cloud below spans ~15 pp of out-of-sample accuracy: one split is one arbitrary point, roughly 50× the leakage bias.
| AUC inflation (pp) · H = | 1 | 5 | 10 | 25 | 50 | 100 | 150 |
|---|---|---|---|---|---|---|---|
| crypto | 0.03 | 0.24 | 0.19 | 0.21 | 0.53 | 0.36 | 1.69 |
| equity | 0.02 | 0.04 | -0.01 | -0.09 | -0.47 | -0.46 | -0.03 |
| forex | 0.12 | 0.17 | -0.00 | 0.27 | 0.45 | 1.07 | 0.85 |
AUC inflation = leaky standard k-fold − purged+embargo, percentage points, mean over each market's instruments. Positive = the leaky split scores the same model too high. Crypto and forex rise with H; equities stay flat by the H/fold-size law. Embargo, once labels are correctly purged, is a small second-order safeguard, keep it ~1%.
Code & data
lopez-de-prado-work-review / projects / 04_cross_validation
A precision filter that rescues a loser to break-even, but never to a deflated edge
A primary signal fixes the side of every bet (here a structural EMA crossover). Triple-barrier labeling, profit-take, stop-loss and a max-holding vertical, scored by first touch on full intrabar OHLC, gives the realised outcome. A secondary classifier then predicts P(the bet is profitable) and decides whether to act and how big; it can veto and size, but never flip the side.
López de Prado's claim is that this converts a high-recall, low-precision primary into something tradeable. It does, measured across 42 instruments, the precision lift is real and near-universal.
The meta-layer does exactly what the textbook says: it raises the fraction of acted bets that are profitable above the unconditional rate, lifts profit factor in 38/42 instruments and DSR in 39/42, and roughly doubles the count of instruments with PF>1 (from 12 to 24). The equity curves are the picture: the primary bleeds steadily from costs while the meta line stays near-flat by vetoing the bad bets.
And yet 0 of 42 instruments clear DSR>0.95, for the primary or the meta variant. The reading is the point of the study: “raw improvement over a bad primary” and “survives deflation” are different bars, and only the second counts. Meta-labeling is a precision filter, not an alpha source. It can only keep the better subset of the bets the primary already proposes; if those bets carry no real edge net of costs, and a vanilla EMA crossover on liquid markets does not, there is nothing to concentrate into a deflated win. The best single result (SPY, meta DSR 0.76) rests on just 6 acted trades and is flagged as a small-sample artifact, left visible rather than hidden. Equities come closest (highest precision lift, median meta DSR 0.095) but still fall short.
| market | PF prim→meta | SR prim→meta | DSR prim→meta | DSR>0.95 | PBO |
|---|---|---|---|---|---|
| Crypto (27 perps) | 0.92 → 1.03 | −0.61 → +0.24 | 0.000 → 0.017 | 0 | 0.41 |
| US Equity (7 ETFs) | 0.75 → 1.03 | −0.68 → +0.31 | 0.000 → 0.095 | 0 | 0.38 |
| Forex (8 majors) | 0.85 → 0.94 | −1.13 → −0.27 | 0.000 → 0.000 | 0 | 0.19 |
By-market medians, net of costs; SR annualised; DSR is the headline. Meta lifts PF and SR everywhere, but the DSR>0.95 survivor count is 0 in every market, the precision filter has no real edge to concentrate.
Correction: meta-labeling a primary that already has an edge
The conclusion above, “a precision filter, not alpha”, was measured on an edgeless primary: a plain EMA crossover with no established edge net of costs. That is precisely the case López de Prado warns against. His meta-labeling has a stated precondition: the primary must already have an edge (high recall, mediocre precision), and the secondary's job is to raise precision by vetoing its worst bets. A filter can only concentrate edge that already exists; on a non-edge there is nothing to concentrate. So the blanket null above is a statement about the primary, not about the method.
To test the method as specified, we replaced the vanilla crossover with a proven structural order-flow / open-interest edge, a closed, pre-validated strategy, reused read-only through its verified engine , and re-ran the identical apparatus: triple-barrier outcomes, a tree secondary on causal features, purged walk-forward, per-fill costs, DSR as the headline. The only thing that changed is the one thing that mattered: the primary now carries a real edge.
With the precondition met, the secondary amplifies the edge rather than rescuing a loser. On a focused two-pair sample, profit factor rises from 1.26 to 1.79 and mean per-trade P&L from 47 to 148 bp out-of-sample, net of per-fill costs. The gate keeps 64% of the edge's bets and vetoes the low-confidence tail, lifting the profitable-bet rate from 53.0% to 63.6%, exactly the precision lift the textbook describes, now acting on bets that actually carry edge, so it shows up in P&L rather than only in a confusion matrix.
The deflated metric moves in the right direction too: DSR rises from 0.64 to 0.78. But it still sits under the program's 0.95 publication bar. This is a focused two-pair demonstration; the point is the direction and mechanism of the correction, it inverts the blanket “not alpha” claim, not a new deflation-passing trophy. The lift holds on both pairs individually (PF 1.32→1.49 and 1.23→1.96).
| arm (OOS, 2-pair) | trades | PF | per-trade bp | annual SR | DSR |
|---|---|---|---|---|---|
| primary (edge alone) | 166 | 1.26 | 47.3 | 10.3 | 0.636 |
| meta (gate + size by p) | 107 | 1.79 | 148.1 | 20.5 | 0.778 |
Out-of-sample, net of per-fill costs, two-pair sample. PF and per-trade P&L lift sharply once the primary has a real edge; DSR rises 0.636→0.778 but does not clear the 0.95 bar. The headline is the primary-versus-meta comparison under the same deflation, not the absolute DSR level.
But it washes out at scale
The two-pair magnitudes do not survive breadth. Running the same precondition-met meta-gate across 6 closed structural edges × 26 perp pairs (154 of 156 configurations produced trades) and counting the trial family honestly, the lift largely washes out. Median profit factor moves only 0.967→0.986 (median lift +0.02, against the two-pair sample's +0.53) and median per-trade P&L moves −5.9→−1.2 bp (median lift +4.7 bp, against +100 bp). The direction of the correction survives broadly; the headline magnitude does not generalise.
And nothing clears deflation at scale. Only 1 of 154 meta-gated configurations clears DSR>0.95 (best single sleeve: funding squeeze on one pair at meta DSR 0.983); the median meta DSR across sleeves is just 0.1. Pooling the gated sleeves into one weakly-correlated meta-strategy, deflated against the dispersion of the 156 configurations actually searched, gives a pooled DSR of 0.0 (benchmark 0.35, pooled PF 0.98, 14,725 out-of-sample trades). It does not clear the 0.95 bar.
The honest synthesis: the blanket “not alpha” claim was a statement about the edgeless primary, not about meta-labeling. With López de Prado's precondition met, the mechanism is real and directional, on a focused sample the secondary genuinely amplifies a primary that already has an edge, roughly tripling per-trade P&L and lifting profit factor from 1.26 to 1.79.
Yet it does not survive scale plus deflation as a tradeable edge: the focused magnitude does not generalise across 156 configurations, and the pooled meta-strategy deflates to zero. The corrected reading is the one López de Prado actually makes, with the precondition restored to the front of the sentence: meta-labeling improves an edge you already have; it cannot manufacture one you do not, and improving an edge in direction is not the same as clearing the deflation bar at scale.
Code & data
lopez-de-prado-work-review / projects / 03_meta_labeling
A marginally better label, but the edge is in the horizon, not the labeller
Instead of a fixed holding horizon, trend-scanning regresses log-price on time over every forward window in a band and labels each observation with the sign of the most statistically significant local trend, the OLS slope t-value maximised over horizon. The data picks the horizon at which the trend is clearest, and the significance becomes a sample weight.
The head-to-head: does this make a better supervised target than the fixed-horizon labeller it is meant to beat, and than a triple-barrier meta-overlay, feeding one identical secondary model?
Trend-scanning is a modest, real improvement on profit factor: it beats fixed-horizon in 28/42 instruments (sign-test p=0.022). The data-chosen horizon is genuine, the winning L* ranges 24–111 bars and tracks the band, not a corner. But on the metric that counts it is a tie: DSR>0.95 survivors are 19 (trend-scan) vs 18 (fixed-horizon), and the trend-scan DSR win rate (21/42, p=0.56) is not significant. A triple-barrier meta-overlay adds no deflated value here and is significantly worse on DSR (p=0.9995), its symmetric barriers almost never bind, so the meta-model acts ~97% of the time and merely inflates trial dispersion.
The transferable finding is the robustness curve: the edge depends on a long-enough look-forward window, not the labeller. The (10,60) and (20,120) bands are robustly strong (median DSR 0.83–1.00, PF>1) across all three markets; the short (5,30) band collapses, equities go to DSR 0 / PF 0.79, crypto halves, because short windows are dominated by microstructure noise and per-turnover cost. (A technical note worth flagging: at these high event counts DSR saturates to {0,1} per instrument, so the honest summary is the survivor count and the sign test, not the median DSR.)
| market | PF fix→TS | SR_ann (TS) | DSR fix→TS | DSR>0.95 fix/TS | PBO (TS) |
|---|---|---|---|---|---|
| Crypto (27 perps) | 1.22 → 1.36 | 4.88 | 0.508 → 0.376 | 11 / 12 | 0.46 |
| US Equity (7 ETFs) | 1.86 → 1.89 | 4.03 | 0.004 → 0.340 | 3 / 3 | 0.06 |
| Forex (8 majors) | 1.50 → 1.46 | 7.99 | 0.753 → 0.853 | 4 / 4 | 0.47 |
By-market medians (deepening run), net of realistic costs; SR annualised. Trend-scan lifts PF over fixed-horizon, but the DSR>0.95 survivor counts are essentially tied (19 vs 18 across all 42).
| trend-scan DSR · look-forward band | (5,30) | (10,60) | (20,120) |
|---|---|---|---|
| Crypto | 0.497 | 0.833 | 0.882 |
| US Equity | 0.000 | 0.981 | 1.000 |
| Forex | 0.972 | 0.997 | 1.000 |
Trend-scan DSR deflated within each band over its three quantile trials. The short (5,30) band collapses; the wider bands hold across all three markets, the edge is in the horizon length, not the labeller.
Code & data
lopez-de-prado-work-review / projects / 05_trend_scanning
The through-line
Three studies, one lesson. How you validate decides whether you can even see a result honestly: standard k-fold quietly inflates the score, and even a clean single split is one draw from a distribution that spans ~50× the leakage it corrects. How you label decides what edge there is to find: a precision filter that vetoes bad bets, or a smarter horizon for the trend, both real improvements in the raw numbers, neither enough to manufacture a deflated edge on its own.
The common thread is the bar. Across 42 instruments in three asset classes, fully costed and leakage-controlled, the methods do exactly what the literature says, and then the Deflated Sharpe Ratio dissolves most of the apparent gain. That gap, between “improves the backtest” and “survives deflation,” is the whole point of getting labeling and cross-validation right.
Method, in brief
- Real 1-minute data → information-driven bars (dollar bars for crypto/equities within regular hours, tick bars for forex); ~20k bars per instrument.
- Every position change costed (crypto 7 bp; equities/forex a time-of-day spread + commission realism layer or 2 bp/1 bp); triple-barrier scans use full intrabar OHLC, never close-only.
- Causal features only; the single forward-looking object is the label. Secondary models trained under purged k-fold CV with a 1% embargo so overlapping labels cannot leak.
- Verdicts read off the Deflated Sharpe Ratio after honest trial-counting, with PBO and effective-N as supporting deflation diagnostics, never a single backtest number.
- Hot loops (label construction, triple-barrier first-touch, multi-horizon OLS-t scan) implemented as Numba kernels and verified bit-identical against independent references; the model fit, correctly, remains the bottleneck.
Cite
References
The primary sources for the methods reproduced here:
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Ch. 3 (labeling & meta-labeling), Ch. 4 (sample weights), Ch. 7 (cross-validation in finance) and Ch. 12 (backtesting through cross-validation / CPCV).
- López de Prado, M. (2020). Machine Learning for Asset Managers. Cambridge University Press. Ch. 5 (trend-scanning labels).
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).
- Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance, 20(4).
See also
These studies share the Deflated-Sharpe gate developed in Backtest Overfitting & the Deflated Sharpe Ratio, and the information-driven bars from Information-Driven Bars. The broader body of work is at Research.

