Research review · cross-sectional machine learning · study 15
Cross-Sectional Machine Learning
Run the identical apparatus in machine learning's strongest regime, ranking the whole cross-section each bar instead of forecasting one series' direction, and the program's first edge survives deflation, landing right alongside the best static archetype without beating it
Every machine-learning result earlier in this review lived in ML's weakest regime: forecasting a single series' direction. It failed deflation, exactly as the literature predicts. The fair test of the “nothing clears deflated significance” through-line is to run the identical purged-cross-validation, Deflated-Sharpe-Ratio, Probability-of-Backtest-Overfitting and effective-N apparatus in ML's strongest regime, cross-sectional return prediction, where each bar you rank the whole universe and rotate capital market-neutral into the best names. The verdict flips, partially. The cross-sectional edge survives deflation, but it approaches rather than beats the best static carry/momentum archetype. That nuance is the result worth publishing.
Advances in Financial ML, Ch. 7 & 8
López de Prado (2018) · purged CV & feature importance
Code & data
lopez-de-prado-work-review / projects / 15
the claim
What López de Prado says
- Machine learning adds value when it predicts the cross-section of returns under proper validation, and far less when it is asked to forecast a single series' direction, the cross-section is the defensible regime.
- Any backtested edge must clear the Deflated Sharpe Ratio against the number and dispersion of trials, and the Probability of Backtest Overfitting must be low for an in-sample ranking to mean anything out of sample.
- On tabular financial data, gradient-boosted trees tie or beat deep neural nets more often than the reverse.
our result
What we found
- The regime split is exactly as described. Single-series ML is a coin-flip or worse out of sample (PBO 0.62) and nothing of ~25,162 strategies survives deflation. Cross-sectional ML has a low PBO (0.10), strong in-sample-to-out-of-sample rank persistence (rho +0.78), and clears the Deflated Sharpe bar on every horizon for the reference model (lgbm, 10 of 10).
- The edge is real but bounded: median out-of-sample profit factor 1.123, net of a full per-fill cost model. It reaches but does not exceed the best static structural archetype (about PF 1.17). The honest headline is that ML edge ≈ static edge, neither dominates.
- Trees won the same-task head-to-head: lgbm, catboost and xgb clear or approach the deflation bar (23 of 30 tree combos), while two neural families run on the identical panel, label, walk-forward and cost model lose money net of costs and clear nothing. The tabular-data finding is confirmed on our own data, not just cited.
The result in three lines
Regime flip · same apparatus
The strongest ML regime clears deflation
Single-series direction forecasting (ML weakest) posts PBO 0.62 and 0 of ~25,162 strategies survive deflation. Swap only the prediction target to cross-sectional ranking (ML strongest) and the reference tree clears the Deflated Sharpe bar on 10 of 10 horizons, with PBO 0.10 and IS→OOS rank persistence rho +0.78.
Edge ≈ static · neither dominates
Real, costed, deflation-surviving, but bounded
The rotating long-short portfolio earns a median out-of-sample profit factor of 1.123 across ten horizons, net of full per-fill costs. That reaches the best static carry/momentum archetype (about PF 1.17) by a different, less crowded route, but does not beat it. The value of ML here is parity, not a free lunch.
Trees beat nets · 23/30 vs 0
Same task, same costs, trees win
Three gradient-boosted-tree families clear or approach the False-Strategy-Theorem deflation bar (lgbm 10/10, catboost 9/10, xgb 5/10, 23 of 30 combos). A feed-forward and a recurrent neural net on the identical panel, label, walk-forward and cost model both land below break-even after costs (PF ≤ 1, negative Sharpe) and clear nothing.
The regime flip, single-series vs cross-sectional
Swap only the prediction target, and the program's first edge survives deflation
Same models, same triple-barrier labeling, same purged walk-forward, same cost model. The only thing that changes is the prediction target. Single-series direction asks one noisy series whether it goes up or down; the program's audit put the effective number of independent bets at about 40 out of 25,162 nominal strategies, and the in-sample best lands in the out-of-sample bottom half 62% of the time. Deflation kills all of it.
The cross-section asks a much easier, more stable question, which names will out-perform which, and aggregates across a wide universe each bar, which averages down idiosyncratic noise and yields a ranking that persists out of sample (rho +0.78). That persistence is exactly what lets it clear the deflation bar. The flip is the literature's prediction, observed.
| regime | PBO | DSR survival | IS→OOS rho | median OOS PF | verdict |
|---|---|---|---|---|---|
| single-series (ML weakest)pooled band, 25,162 strategies | 0.62 | 0 of 25,162 | n/a | ~1.0 | fails deflation |
| cross-sectional (ML strongest)gradient-boosted trees, 10 horizons | 0.10 | 10 of 10 horizons | +0.78 | 1.123 | survives deflation |
PBO via CSCV; effective-N via eigenvalue participation ratio; DSR via the False-Strategy-Theorem benchmark, all from the program's deflation toolkit. Headline PBO and rank-persistence are the canonical per-window figures from the dedicated rotation engine (the correct trial axis). Effective-N is measured on different axes (single-series: about 40 independent of 25,162; cross-sectional: independent horizon-bets of 10), so the two are not directly comparable. The best static structural carry/momentum benchmark sits at about PF 1.17.
The cross-sectional edge, horizon by horizon
One rotating long-short portfolio per forecast horizon
Each bar, rank the alive universe by a causal per-name tree score, go long the top quantile and short the bottom, market-neutral, with the bet-sizing knobs (quantile, long-only, tilt, rebalance throttle) tuned in-sample per walk-forward window. Ten horizons were searched; they are part of the same trial family the deflation bar accounts for. The edge is strongest at short to medium horizons and decays gently as the holding period grows, consistent with a momentum-and-reversal cross-section rather than one lucky horizon. The reference model (lgbm) clears the rigorous per-horizon Deflated Sharpe (skew- and kurtosis-adjusted, deflated against the ten-horizon family) on all ten.
| H (bars) | OOS PF | OOS Sharpe (ann.) | DSR | survives |
|---|---|---|---|---|
| 4 | 1.148 | 3.23 | ~1.00 | yes |
| 6 | 1.127 | 2.79 | ~1.00 | yes |
| 8 | 1.120 | 2.61 | ~1.00 | yes |
| 12 | 1.158 | 2.66 | ~1.00 | yes |
| 24 | 1.149 | 3.62 | ~1.00 | yes |
| 48 | 1.111 | 2.62 | ~1.00 | yes |
| 72 | 1.140 | 3.36 | ~1.00 | yes |
| 120 | 1.107 | 2.56 | ~1.00 | yes |
| 168 | 1.095 | 2.39 | ~1.00 | yes |
| 336 | 1.089 | 2.19 | ~1.00 | yes |
Annualized Sharpe uses the engine's own bar-frequency annualization and is indicative; rotation returns are autocorrelated within a holding period, so the effective number of independent observations, and the honest Sharpe, are lower than the raw bar count implies. The deflation verdict and the profit factor do not depend on that annualization choice. Each horizon is evaluated over about 42,000 out-of-sample bars.
The deflation verdict, family by family
The cross-sectional survival is not a single-model artefact. The whole sweep is treated as one multiple-testing family. Under the False Strategy Theorem, the expected maximum Sharpe a skill-less searcher would post across N trials is computed from the dispersion of the realized per-bar out-of-sample Sharpe estimates, and a (family, horizon) combo “clears” only if its out-of-sample Sharpe exceeds that bar. At the headline trial count N=40 the bar is an annualized Sharpe of about 1.76.
That is 23 of 30 tree combos clearing a genuine multiple-testing bar, versus 0 of about 25,162 single-series strategies. The three families were run end-to-end on the same survivorship-honest panel and the same purged walk-forward, so the survival is corroborated across independent tree implementations, not asserted from one. (A PBO recomputed across only the ten horizon columns of one model reads 0.62 by construction, the horizons are near-identical bets of the same model, so that cross-check is reported clearly labelled and is not the headline; the canonical per-window PBO is 0.10.)
| family | horizons clearing (of 10) | median OOS PF | verdict |
|---|---|---|---|
| lgbm | 10 of 10 | 1.123 | clears |
| catboost | 9 of 10 | 1.125 | clears (longest falls below) |
| xgb | 5 of 10 | 1.072 | partial |
Trees versus neural networks on the identical task
The fair counter-question is whether a neural network, given the same panel, the same forward cross-sectional rank label, the same purged walk-forward, the same rotation sizing search and the same per-fill cost model, does any better. The literature's repeated finding is that it does not. We tested this directly rather than asserting it: a per-bar feed-forward network (mlp) and a per-pair recurrent network (lstm) were run on a managed accelerator (the full panel exceeds a local card) and scored from their real out-of-sample ledgers against the same False-Strategy-Theorem bar the trees faced.
This is a clean, same-task confirmation of the literature's tabular finding on our own data. The neural families are not a near-miss that better tuning would rescue here, they are below break-even after costs, while the trees clear a genuine deflation haircut. Trees beat the nets on the cross-sectional rotation, under identical costs and validation.
| family | H | OOS PF | OOS Sharpe (ann.) | DSR | verdict |
|---|---|---|---|---|---|
| mlp | 24 | 0.984 | −0.40 | 0.13 | below bar |
| mlp | 72 | 0.982 | −0.44 | 0.11 | below bar |
| mlp | 168 | 0.983 | −0.42 | 0.12 | below bar |
| lstm | 6 | 0.940 | −1.51 | n/a | below bar |
Each neural row is scored from its real per-bar out-of-sample ledger on exactly the apparatus the trees used, including the same False-Strategy bar. The remaining neural families (a gated-recurrent set, a temporal-convolution set and a cross-pair attention set) did not produce a usable ledger on this pass and are recorded as deferred, with no number attached, rather than estimated.
Multi-market, does the edge hold outside crypto?
The edge replicates in US equities, but its deflated significance scales with breadth and turnover
The crypto cross-section is a wide panel of hundreds of liquid instruments. To check whether the edge is a property of that one market or of the rotation idea, we ran the identical recipe on a bounded, liquid universe of large-cap US stocks, about 240 names, a median of roughly 245 live each bar, adjusted daily bars over roughly eleven years. Each bar we rank the live universe by the same causal per-name tree score, rotate long the top quantile and short the bottom (market-neutral), tune the rotation knobs in-sample inside each purged walk-forward window, and charge realistic per-fill costs (median 3.4 bp/side). The model family is the gradient-boosted tree that won in crypto.
The raw edge survives the move to equities, but it is thinner. The profit factor is in the same neighbourhood as crypto, a median out-of-sample profit factor of about 1.105 across the five horizons (crypto about 1.123), peaking near 1.15 at the two-week horizon, with the medium horizons strongest and the very short and very long horizons weakest. But the annualized Sharpe ratios are far lower than crypto's (roughly 0.3 to 0.7 versus crypto's roughly 2.7), because a daily equity rotation has far fewer rebalances and a much smaller effective sample than an hourly rotation over a much larger panel.
| H (days) | OOS PF | OOS Sharpe (ann.) | DSR | survives |
|---|---|---|---|---|
| 1 | 1.096 | 0.44 | 0.75 | no |
| 5 | 1.105 | 0.53 | 0.83 | no |
| 10 | 1.149 | 0.74 | 0.94 | no |
| 21 | 1.132 | 0.66 | 0.91 | no |
| 63 | 1.060 | 0.29 | 0.59 | no |
Survivorship. The equity universe is a fixed list of current large-cap members, so names that were large-cap earlier but have since left the index (delistings, takeovers, demotions) are absent. This is a mild upward bias on the long leg; the result is survivorship-aware but not survivorship-free, and should be read as an upper-ish bound rather than a clean estimate. Later listings enter the cross-section only once they trade (within-live-span gating), so there is no look-ahead onto pre-listing dates. A fully survivorship-free run on a point-in-time membership history is the natural follow-up.
The cross-sectional ML edge partially replicates in US equities: the sign and magnitude of the profit factor carry over almost exactly, but the statistical significance does not, because daily large-cap rotation simply has less independent information per unit time than a wide hourly crypto panel. This sharpens rather than overturns the headline. The cross-sectional regime is genuinely the stronger ML regime in both markets, it produces a real, costed, positive edge where single-series direction forecasting produces noise, but whether that edge clears a multiple-testing haircut depends on the breadth and rebalance frequency of the panel: the widest, fastest panel clears it; daily large-cap equities fall just short.
Method
- Single-series leg: the program's pooled band corpus of 25,162 direction-forecasting strategies (triple-barrier labels, purged walk-forward, per-fill costs), scored through the program's deflation toolkit (PBO via CSCV, effective-N via the eigenvalue participation ratio, DSR via the False Strategy Theorem). Verdict: PBO 0.62, effective-N about 40, zero deflation survivors.
- Cross-sectional leg: a wide survivorship-honest perpetual-swap universe of 787 pairs. Each bar, rank the alive universe by a causal per-name gradient-boosted-tree score and rotate a market-neutral long-short portfolio into the top and bottom quantiles, with the bet-sizing knobs tuned in-sample per purged walk-forward window. Ten forecast horizons (4 to 336 bars), full per-fill cost model. Real data only.
- Deflation accounting: the whole cross-sectional sweep is one multiple-testing family. The False-Strategy-Theorem expected-max skill-less Sharpe is computed from the dispersion of realized per-bar OOS Sharpe across the 30 completed tree combos; a combo clears only if its OOS Sharpe exceeds that bar. The reference model additionally carries a rigorous per-horizon DSR (skew- and kurtosis-adjusted, deflated against the ten-horizon family).
- Tree families: lgbm, catboost and xgb ran end-to-end on the full panel. Random forest and extra-trees were cut as impractically slow and memory-heavy at full width; the linear families were stopped by the memory guard partway through the search. These carry no result and are recorded as deferred.
- Neural families: a per-bar feed-forward network (mlp) and a per-pair recurrent network (lstm) were run on a managed accelerator on the identical panel, label, walk-forward, sizing search and cost model, then scored from their real OOS ledgers against the same False-Strategy bar. The gated-recurrent, temporal-convolution and cross-pair attention sets did not produce a usable ledger and are deferred with no number.
- Multi-market check: the unmodified cross-sectional engine on a bounded, liquid universe of about 240 large-cap US names, adjusted daily bars over roughly eleven years (median about 245 live per bar), five horizons (one day to one quarter), realistic per-fill costs (median 3.4 bp/side). Survivorship-aware (current members, within-live-span gating); a thin bonus FX-and-sector smoke run is banked separately.
Notes & honest assessment
This closes the program's only structural gap, it had no cross-sectional coverage, and delivers its first deflation-surviving ML edge. The honest headline is kept throughout: the cross-sectional ML edge approaches but does not beat the best static structural archetype (about PF 1.17). The publishable claim is therefore stronger and more defensible than “nothing works”: machine learning earns a real, deflation-surviving cross-sectional edge, but it lands at roughly the same place a well-built static carry/momentum strategy already sits. ML edge ≈ static edge; neither dominates. The value of ML here is parity through a different, less crowded route, not a free lunch over structure. Where a statistic is recomputed on a different trial axis than the dedicated engine, the canonical per-window figure is the headline and the recompute is labelled a cross-check; the two are never conflated.
Reproducibility
The single-series and cross-sectional results, the family-level deflation table, the neural-family ledgers, the US-equity leg, the figures and the deflation toolkit are collected in project 15 of lopez-de-prado-work-review. The analysis scripts recompute DSR, PBO and effective-N from the banked ledgers and are idempotent and robust to missing inputs; they read ledgers and train nothing.
Cite
References
The primary sources for the work reviewed here:
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. (Ch. 7 purged cross-validation; Ch. 8 feature importance.)
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).
- Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance, 20(4).
- López de Prado, M. (2019). A Data Science Solution to the Multiple-Testing Crisis in Financial Research. Journal of Financial Data Science, 1(1).
See also
The companion study on multiple-testing deflation is Backtest Overfitting & the Deflated Sharpe Ratio, the labeling and purged-cross-validation machinery is in Labeling & Cross-Validation, the selection-discipline theme runs through The edge is in the process, and the broader body of work is at Research.

