Back to López de Prado

Research review · cross-sectional machine learning · study 15

Cross-Sectional Machine Learning

Run the identical apparatus in machine learning's strongest regime, ranking the whole cross-section each bar instead of forecasting one series' direction, and the program's first edge survives deflation, landing right alongside the best static archetype without beating it

Every machine-learning result earlier in this review lived in ML's weakest regime: forecasting a single series' direction. It failed deflation, exactly as the literature predicts. The fair test of the “nothing clears deflated significance” through-line is to run the identical purged-cross-validation, Deflated-Sharpe-Ratio, Probability-of-Backtest-Overfitting and effective-N apparatus in ML's strongest regime, cross-sectional return prediction, where each bar you rank the whole universe and rotate capital market-neutral into the best names. The verdict flips, partially. The cross-sectional edge survives deflation, but it approaches rather than beats the best static carry/momentum archetype. That nuance is the result worth publishing.

source

Advances in Financial ML, Ch. 7 & 8

López de Prado (2018) · purged CV & feature importance

Code & data

lopez-de-prado-work-review / projects / 15

the claim

What López de Prado says

  • Machine learning adds value when it predicts the cross-section of returns under proper validation, and far less when it is asked to forecast a single series' direction, the cross-section is the defensible regime.
  • Any backtested edge must clear the Deflated Sharpe Ratio against the number and dispersion of trials, and the Probability of Backtest Overfitting must be low for an in-sample ranking to mean anything out of sample.
  • On tabular financial data, gradient-boosted trees tie or beat deep neural nets more often than the reverse.

our result

What we found

  • The regime split is exactly as described. Single-series ML is a coin-flip or worse out of sample (PBO 0.62) and nothing of ~25,162 strategies survives deflation. Cross-sectional ML has a low PBO (0.10), strong in-sample-to-out-of-sample rank persistence (rho +0.78), and clears the Deflated Sharpe bar on every horizon for the reference model (lgbm, 10 of 10).
  • The edge is real but bounded: median out-of-sample profit factor 1.123, net of a full per-fill cost model. It reaches but does not exceed the best static structural archetype (about PF 1.17). The honest headline is that ML edge ≈ static edge, neither dominates.
  • Trees won the same-task head-to-head: lgbm, catboost and xgb clear or approach the deflation bar (23 of 30 tree combos), while two neural families run on the identical panel, label, walk-forward and cost model lose money net of costs and clear nothing. The tabular-data finding is confirmed on our own data, not just cited.

The result in three lines

Regime flip · same apparatus

The strongest ML regime clears deflation

Single-series direction forecasting (ML weakest) posts PBO 0.62 and 0 of ~25,162 strategies survive deflation. Swap only the prediction target to cross-sectional ranking (ML strongest) and the reference tree clears the Deflated Sharpe bar on 10 of 10 horizons, with PBO 0.10 and IS→OOS rank persistence rho +0.78.

Edge ≈ static · neither dominates

Real, costed, deflation-surviving, but bounded

The rotating long-short portfolio earns a median out-of-sample profit factor of 1.123 across ten horizons, net of full per-fill costs. That reaches the best static carry/momentum archetype (about PF 1.17) by a different, less crowded route, but does not beat it. The value of ML here is parity, not a free lunch.

Trees beat nets · 23/30 vs 0

Same task, same costs, trees win

Three gradient-boosted-tree families clear or approach the False-Strategy-Theorem deflation bar (lgbm 10/10, catboost 9/10, xgb 5/10, 23 of 30 combos). A feed-forward and a recurrent neural net on the identical panel, label, walk-forward and cost model both land below break-even after costs (PF ≤ 1, negative Sharpe) and clear nothing.

The regime flip, single-series vs cross-sectional

Swap only the prediction target, and the program's first edge survives deflation

Fig. 1:The same purged-CV, DSR, PBO and effective-N apparatus on both regimes. Single-series direction forecasting (ML weakest) has high overfitting probability (PBO 0.62) and zero deflation survivors out of ~25,162 strategies; the cross-sectional rotation (ML strongest) has a low PBO (0.10) and clears the Deflated Sharpe bar on every horizon for the reference tree.

Same models, same triple-barrier labeling, same purged walk-forward, same cost model. The only thing that changes is the prediction target. Single-series direction asks one noisy series whether it goes up or down; the program's audit put the effective number of independent bets at about 40 out of 25,162 nominal strategies, and the in-sample best lands in the out-of-sample bottom half 62% of the time. Deflation kills all of it.

The cross-section asks a much easier, more stable question, which names will out-perform which, and aggregates across a wide universe each bar, which averages down idiosyncratic noise and yields a ranking that persists out of sample (rho +0.78). That persistence is exactly what lets it clear the deflation bar. The flip is the literature's prediction, observed.

regimePBODSR survivalIS→OOS rhomedian OOS PFverdict
single-series (ML weakest)pooled band, 25,162 strategies0.620 of 25,162n/a~1.0fails deflation
cross-sectional (ML strongest)gradient-boosted trees, 10 horizons0.1010 of 10 horizons+0.781.123survives deflation

PBO via CSCV; effective-N via eigenvalue participation ratio; DSR via the False-Strategy-Theorem benchmark, all from the program's deflation toolkit. Headline PBO and rank-persistence are the canonical per-window figures from the dedicated rotation engine (the correct trial axis). Effective-N is measured on different axes (single-series: about 40 independent of 25,162; cross-sectional: independent horizon-bets of 10), so the two are not directly comparable. The best static structural carry/momentum benchmark sits at about PF 1.17.

The cross-sectional edge, horizon by horizon

One rotating long-short portfolio per forecast horizon

Each bar, rank the alive universe by a causal per-name tree score, go long the top quantile and short the bottom, market-neutral, with the bet-sizing knobs (quantile, long-only, tilt, rebalance throttle) tuned in-sample per walk-forward window. Ten horizons were searched; they are part of the same trial family the deflation bar accounts for. The edge is strongest at short to medium horizons and decays gently as the holding period grows, consistent with a momentum-and-reversal cross-section rather than one lucky horizon. The reference model (lgbm) clears the rigorous per-horizon Deflated Sharpe (skew- and kurtosis-adjusted, deflated against the ten-horizon family) on all ten.

Fig. 2:Cross-sectional lgbm rotation, per horizon. Out-of-sample profit factor ranges 1.089–1.158 with median 1.123; the edge peaks at short-to-medium horizons (H=12 and H=24) and decays gently to the longest horizon (H=336). All ten horizons clear the rigorous Deflated Sharpe bar (DSR > 0.95).
H (bars)OOS PFOOS Sharpe (ann.)DSRsurvives
41.1483.23~1.00yes
61.1272.79~1.00yes
81.1202.61~1.00yes
121.1582.66~1.00yes
241.1493.62~1.00yes
481.1112.62~1.00yes
721.1403.36~1.00yes
1201.1072.56~1.00yes
1681.0952.39~1.00yes
3361.0892.19~1.00yes

Annualized Sharpe uses the engine's own bar-frequency annualization and is indicative; rotation returns are autocorrelated within a holding period, so the effective number of independent observations, and the honest Sharpe, are lower than the raw bar count implies. The deflation verdict and the profit factor do not depend on that annualization choice. Each horizon is evaluated over about 42,000 out-of-sample bars.

Fig. 3:Out-of-sample profit factor and Sharpe by tree family and horizon, against the expected-max skill-less Sharpe bar (False Strategy Theorem, N=40). lgbm clears all ten horizons, catboost nine of ten (only the longest falls below), xgb five of ten, 23 of 30 tree (family, horizon) combos clearing a genuine multiple-testing bar.

The deflation verdict, family by family

The cross-sectional survival is not a single-model artefact. The whole sweep is treated as one multiple-testing family. Under the False Strategy Theorem, the expected maximum Sharpe a skill-less searcher would post across N trials is computed from the dispersion of the realized per-bar out-of-sample Sharpe estimates, and a (family, horizon) combo “clears” only if its out-of-sample Sharpe exceeds that bar. At the headline trial count N=40 the bar is an annualized Sharpe of about 1.76.

That is 23 of 30 tree combos clearing a genuine multiple-testing bar, versus 0 of about 25,162 single-series strategies. The three families were run end-to-end on the same survivorship-honest panel and the same purged walk-forward, so the survival is corroborated across independent tree implementations, not asserted from one. (A PBO recomputed across only the ten horizon columns of one model reads 0.62 by construction, the horizons are near-identical bets of the same model, so that cross-check is reported clearly labelled and is not the headline; the canonical per-window PBO is 0.10.)

familyhorizons clearing (of 10)median OOS PFverdict
lgbm10 of 101.123clears
catboost9 of 101.125clears (longest falls below)
xgb5 of 101.072partial

Trees versus neural networks on the identical task

The fair counter-question is whether a neural network, given the same panel, the same forward cross-sectional rank label, the same purged walk-forward, the same rotation sizing search and the same per-fill cost model, does any better. The literature's repeated finding is that it does not. We tested this directly rather than asserting it: a per-bar feed-forward network (mlp) and a per-pair recurrent network (lstm) were run on a managed accelerator (the full panel exceeds a local card) and scored from their real out-of-sample ledgers against the same False-Strategy-Theorem bar the trees faced.

This is a clean, same-task confirmation of the literature's tabular finding on our own data. The neural families are not a near-miss that better tuning would rescue here, they are below break-even after costs, while the trees clear a genuine deflation haircut. Trees beat the nets on the cross-sectional rotation, under identical costs and validation.

Fig. 4:Neural families on the identical cross-sectional task. The feed-forward network (mlp) posts an out-of-sample profit factor of about 0.98 with a small negative annualized Sharpe; the recurrent network (lstm) is weaker still (PF 0.94, clearly negative Sharpe). Neither clears the deflation bar the three tree families clear, and neither approaches the trees' median PF of about 1.12.
familyHOOS PFOOS Sharpe (ann.)DSRverdict
mlp240.984−0.400.13below bar
mlp720.982−0.440.11below bar
mlp1680.983−0.420.12below bar
lstm60.940−1.51n/abelow bar

Each neural row is scored from its real per-bar out-of-sample ledger on exactly the apparatus the trees used, including the same False-Strategy bar. The remaining neural families (a gated-recurrent set, a temporal-convolution set and a cross-pair attention set) did not produce a usable ledger on this pass and are recorded as deferred, with no number attached, rather than estimated.

Multi-market, does the edge hold outside crypto?

The edge replicates in US equities, but its deflated significance scales with breadth and turnover

Fig. 5:US-equity daily rotation (about 240 large-cap names, lgbm). The out-of-sample profit factor tracks crypto (median about 1.105, peaking near 1.15 at the two-week horizon), but zero of five horizons reach a Deflated Sharpe above 0.95, the best gets close (about 0.94 at two weeks, about 0.91 at one month).

The crypto cross-section is a wide panel of hundreds of liquid instruments. To check whether the edge is a property of that one market or of the rotation idea, we ran the identical recipe on a bounded, liquid universe of large-cap US stocks, about 240 names, a median of roughly 245 live each bar, adjusted daily bars over roughly eleven years. Each bar we rank the live universe by the same causal per-name tree score, rotate long the top quantile and short the bottom (market-neutral), tune the rotation knobs in-sample inside each purged walk-forward window, and charge realistic per-fill costs (median 3.4 bp/side). The model family is the gradient-boosted tree that won in crypto.

The raw edge survives the move to equities, but it is thinner. The profit factor is in the same neighbourhood as crypto, a median out-of-sample profit factor of about 1.105 across the five horizons (crypto about 1.123), peaking near 1.15 at the two-week horizon, with the medium horizons strongest and the very short and very long horizons weakest. But the annualized Sharpe ratios are far lower than crypto's (roughly 0.3 to 0.7 versus crypto's roughly 2.7), because a daily equity rotation has far fewer rebalances and a much smaller effective sample than an hourly rotation over a much larger panel.

H (days)OOS PFOOS Sharpe (ann.)DSRsurvives
11.0960.440.75no
51.1050.530.83no
101.1490.740.94no
211.1320.660.91no
631.0600.290.59no

Survivorship. The equity universe is a fixed list of current large-cap members, so names that were large-cap earlier but have since left the index (delistings, takeovers, demotions) are absent. This is a mild upward bias on the long leg; the result is survivorship-aware but not survivorship-free, and should be read as an upper-ish bound rather than a clean estimate. Later listings enter the cross-section only once they trade (within-live-span gating), so there is no look-ahead onto pre-listing dates. A fully survivorship-free run on a point-in-time membership history is the natural follow-up.

The cross-sectional ML edge partially replicates in US equities: the sign and magnitude of the profit factor carry over almost exactly, but the statistical significance does not, because daily large-cap rotation simply has less independent information per unit time than a wide hourly crypto panel. This sharpens rather than overturns the headline. The cross-sectional regime is genuinely the stronger ML regime in both markets, it produces a real, costed, positive edge where single-series direction forecasting produces noise, but whether that edge clears a multiple-testing haircut depends on the breadth and rebalance frequency of the panel: the widest, fastest panel clears it; daily large-cap equities fall just short.

Method

  • Single-series leg: the program's pooled band corpus of 25,162 direction-forecasting strategies (triple-barrier labels, purged walk-forward, per-fill costs), scored through the program's deflation toolkit (PBO via CSCV, effective-N via the eigenvalue participation ratio, DSR via the False Strategy Theorem). Verdict: PBO 0.62, effective-N about 40, zero deflation survivors.
  • Cross-sectional leg: a wide survivorship-honest perpetual-swap universe of 787 pairs. Each bar, rank the alive universe by a causal per-name gradient-boosted-tree score and rotate a market-neutral long-short portfolio into the top and bottom quantiles, with the bet-sizing knobs tuned in-sample per purged walk-forward window. Ten forecast horizons (4 to 336 bars), full per-fill cost model. Real data only.
  • Deflation accounting: the whole cross-sectional sweep is one multiple-testing family. The False-Strategy-Theorem expected-max skill-less Sharpe is computed from the dispersion of realized per-bar OOS Sharpe across the 30 completed tree combos; a combo clears only if its OOS Sharpe exceeds that bar. The reference model additionally carries a rigorous per-horizon DSR (skew- and kurtosis-adjusted, deflated against the ten-horizon family).
  • Tree families: lgbm, catboost and xgb ran end-to-end on the full panel. Random forest and extra-trees were cut as impractically slow and memory-heavy at full width; the linear families were stopped by the memory guard partway through the search. These carry no result and are recorded as deferred.
  • Neural families: a per-bar feed-forward network (mlp) and a per-pair recurrent network (lstm) were run on a managed accelerator on the identical panel, label, walk-forward, sizing search and cost model, then scored from their real OOS ledgers against the same False-Strategy bar. The gated-recurrent, temporal-convolution and cross-pair attention sets did not produce a usable ledger and are deferred with no number.
  • Multi-market check: the unmodified cross-sectional engine on a bounded, liquid universe of about 240 large-cap US names, adjusted daily bars over roughly eleven years (median about 245 live per bar), five horizons (one day to one quarter), realistic per-fill costs (median 3.4 bp/side). Survivorship-aware (current members, within-live-span gating); a thin bonus FX-and-sector smoke run is banked separately.

Notes & honest assessment

This closes the program's only structural gap, it had no cross-sectional coverage, and delivers its first deflation-surviving ML edge. The honest headline is kept throughout: the cross-sectional ML edge approaches but does not beat the best static structural archetype (about PF 1.17). The publishable claim is therefore stronger and more defensible than “nothing works”: machine learning earns a real, deflation-surviving cross-sectional edge, but it lands at roughly the same place a well-built static carry/momentum strategy already sits. ML edge ≈ static edge; neither dominates. The value of ML here is parity through a different, less crowded route, not a free lunch over structure. Where a statistic is recomputed on a different trial axis than the dedicated engine, the canonical per-window figure is the headline and the recompute is labelled a cross-check; the two are never conflated.

Reproducibility

The single-series and cross-sectional results, the family-level deflation table, the neural-family ledgers, the US-equity leg, the figures and the deflation toolkit are collected in project 15 of lopez-de-prado-work-review. The analysis scripts recompute DSR, PBO and effective-N from the banked ledgers and are idempotent and robust to missing inputs; they read ledgers and train nothing.

Cite

References

The primary sources for the work reviewed here:

See also

The companion study on multiple-testing deflation is Backtest Overfitting & the Deflated Sharpe Ratio, the labeling and purged-cross-validation machinery is in Labeling & Cross-Validation, the selection-discipline theme runs through The edge is in the process, and the broader body of work is at Research.

Cross-Sectional Machine Learning: the program's first deflation-surviving ML edge | Daru Finance