Research review · ensembles & feature importance · study 09
Ensembles & Feature Importance
Across 40 instruments in crypto, equities and forex, bagging generalizes ~4.8× better than boosting, and still produces no tradeable edge
López de Prado makes three sharp, testable claims about machine learning on financial data: bagging generalizes better than boosting on noisy, overlapping labels; the in-sample MDI importance is substitution-biased while out-of-sample MDA is not; and a classifier should be tuned by a probabilistic loss, not accuracy. This study puts all three to a multi-market, fully-costed, leakage-controlled test on 40 instruments, 27 crypto perps, 5 US-equity ETFs and 8 forex majors, with purged cross-validation and a Deflated-Sharpe gate. All three replicate cleanly. None of them buys a tradeable edge.
Advances in Financial ML, Ch. 6 & 8
López de Prado (2018) · ensembles & feature importance
Code & data
lopez-de-prado-work-review / projects / 09
the claim
What López de Prado says
- Bagging generalizes better than boosting on financial data: labels are noisy and serially overlapping, so boosting chases that noise by re-weighting hard (often mislabeled) cases and overfits, whereas bagging averages decorrelated trees and is more robust.
- MDI feature importance is biased, it is in-sample and mechanically inflated for correlated features (the substitution effect); use out-of-sample, permutation-based MDA, and cluster correlated features and permute whole clusters to remove the artefact. Tune classifiers by a probabilistic loss (log-loss), not accuracy.
our result
What we found
- All three claims replicate cleanly across 40 instruments in three markets: bagging's IS→OOS generalization gap is ~5× smaller than boosting's (100% of instruments, p=1.8e-12, boosting literally drives OOS log-loss past coin-flip); MDA is essentially bias-free (ρ 0.07 vs MDI 0.49); and log-loss tuning beats accuracy tuning (p=8.3e-6).
- Two honest refinements: on this 10-feature set clustered-MDA buys selection stability, not de-biasing (MDA alone is the de-biasing tool), and the selected feature set is only barely more stable than random.
- The honest verdict: 0 of 40 instruments clear DSR > 0.95. Better generalization is necessary but not sufficient for edge, this is a clean methodology win with zero tradeable alpha.
The result in three lines
40 instruments · 3 markets
Bagging generalizes ~4.8× better than boosting
Built the textbook way, a RandomForest keeps its in-sample-to-out-of-sample log-loss gap ~4.8× tighter than HistGradientBoosting, and does so on 100% of the 40 instruments (Wilcoxon p = 1.8e-12). Boosting drives in-sample log-loss to ~0.25 by memorizing the noisy, overlapping labels, then pays for it with an out-of-sample log-loss past coin-flip (ln2 ≈ 0.693). The first prescription holds unambiguously.
MDI ρ 0.49 · MDA ρ 0.07
MDI is substitution-biased; MDA is not
Mean-Decrease-Impurity's importance is rank-correlated with how correlated a feature is to the rest (ρ ≈ 0.49), the substitution artefact. The out-of-sample permutation measure, MDA scored by log-loss, shows essentially none (ρ ≈ 0.07; lower than MDI on 88% of instruments, p = 1.2e-6). Several MDI-favored features get negative MDA, permuting them improves out-of-sample log-loss. The twist: clustered-MDA buys selection stability, not de-biasing, on this 10-feature set.
0 / 40 clear DSR > 0.95
Better generalization is not edge
The honest verdict. Neither ensemble produces a Deflated Sharpe Ratio anywhere near the 0.95 publishable bar on any of the 40 instruments, the median bet Sharpe is negative for both, and bagging wins the generalization contest decisively yet the DSR contest not at all (paired p = 0.74). A clean methods replication with a null edge result; that separation is the point.
Claim 1, bagging vs boosting, on 40 instruments
Boosting memorizes the train set, then pays for it out-of-sample
The first prescription holds unambiguously. A RandomForest built the textbook way keeps its in-sample-to-out-of-sample log-loss gap about 4.8× tighter than HistGradientBoosting, and it does so on 100% of the 40 instruments across all three markets (paired Wilcoxon p = 1.8e-12).
The mechanism is exactly the textbook one. On the two genuinely noisy markets, crypto and forex, boosting's extra capacity is spent fitting the noisy, serially-overlapping labels: its in-sample log-loss collapses to ~0.25, far below the coin-flip ceiling, but its out-of-sample log-loss is worse than a coin flip (~0.81 > ln2 ≈ 0.693). The bagged forest never memorizes, it holds in-sample at ~0.57 and out-of-sample at ~0.68, staying at or under coin-flip out of sample. Equities are low for both models for a different reason: a class-imbalanced label (base rate ~0.17) the forest can trivially predict in-sample.
bagging IS→OOS gap
0.114
RandomForest, NLL-tuned
boosting IS→OOS gap
0.539
HistGradientBoosting
gap ratio
4.8×
boosting overfits more
RF < HGB
100%
of 40 instruments
paired Wilcoxon
1.8e-12
p-value
DSR > 0.95
0 / 40
either ensemble
| market | RF IS | RF OOS | HGB IS | HGB OOS |
|---|---|---|---|---|
| Crypto27 perps | 0.569 | 0.680 | 0.255 | 0.809 |
| Equities5 ETFs | 0.134 | 0.557 | 0.153 | 0.659 |
| Forex8 majors | 0.569 | 0.677 | 0.277 | 0.808 |
Median log-loss per market, NLL-tuned. Boosting's out-of-sample log-loss is past coin-flip (ln2 ≈ 0.693) in crypto and forex, worse than a coin flip. Bagging's stays at or below it everywhere.
Explore the overfit
The same effect, one instrument or one market at a time. Every number is the real per-instrument panel result, NLL-tuned, purged-CV, costed.
Pick a market to see the median of its panel, or drill into a single instrument. The left chart contrasts in-sample and out-of-sample log-loss for the bagged forest and the boosted trees, against the coin-flip line; the right chart shows the substitution bias of the three importance measures. Boosting's in-sample bar sits far below coin-flip, it has memorized the labels, while its out-of-sample bar typically runs past it.
Demo, Bagging vs boosting overfit explorer
Boosting drives in-sample log-loss far below coin-flip (it memorizes the train set) yet its out-of-sample log-loss runs past coin-flip. The bagged forest keeps in-sample and out-of-sample close. Right panel: only MDA's importance is free of substitution bias.
Crypto: median over 27 instruments. Boosting's in-sample log-loss is 0.255 (well below coin-flip ln2 ≈ 0.693, it has memorized the labels), but its out-of-sample log-loss is 0.808, past coin-flip. The bagged forest holds in-sample at 0.569 and out-of-sample at 0.680, a gap 4.9× smaller. Substitution bias ρ(importance, |corr|): MDI 0.491, MDA 0.115, clustered-MDA 0.552, only MDA is near zero.
The panel reproduced bit-identically across runs (random_state = 0 throughout, BLAS/OMP pinned to one thread per process). The non-sklearn hot loops, the sequential-bootstrap draw and the co-event count / average label uniqueness, are JIT-compiled and verified bit-identical against pure-Python references.
Claim 2, MDI is biased, MDA is not
Mean-Decrease-Impurity is computed in-sample and is mechanically inflated for correlated features: a method's importance ends up rank-correlated with how correlated each feature is to the rest (ρ ≈ 0.49, highest in forex at 0.67). The momentum block and the vol block split the impurity credit and inflate each other. The out-of-sample permutation measure, MDA scored by log-loss, shows essentially none of this (ρ ≈ 0.07; lower than MDI on 88% of instruments, p = 1.2e-6). Strikingly, several features MDI ranks highly earn negative MDA: permuting them improves out-of-sample log-loss, so the model was over-relying on noise. MDI keeps them; MDA flags them as harmful.
The honest twist is that clustered-MDA does not remove the bias on this feature set. With only 10 features collapsing into ~5 clusters, the correlated features land in the same cluster, so whole-cluster permutation still attributes large importance to that correlated cluster, and expanding it back to features re-introduces the |corr| association (ρ ≈ 0.52, statistically indistinguishable from MDI, p = 0.22). What clustering does buy is selection stability: collapsing substitutes raises the top-3 Jaccard across CPCV paths from MDA's 0.26 to 0.46 (p = 1.8e-12). MDA's own selection is genuinely above the random baseline (0.20) on 98% of instruments, but only barely stable, one should not over-interpret “the top features” on this task.
| method | bias ρ | stability | note |
|---|---|---|---|
| MDI | 0.485 | 0.582 | in-sample baseline, biased |
| MDA | 0.073 | 0.263 | out-of-sample permutation, unbiased |
| clustered-MDA | 0.517 | 0.463 | stability, not de-biasing |
| random baseline | , | 0.201 | 3 of 10 features at random |
Median over the 40-instrument panel. Bias ρ is the Spearman correlation between a method's importance and each feature's mean |correlation|; stability is the mean top-3 Jaccard across the 15 CPCV paths. The lesson: with a small, moderately-correlated feature set, MDA, not clustered-MDA, is the de-biasing tool; clustering matters more when there are many tightly-correlated features to absorb.
The fix, measured the right way
Score importance per cluster, not per feature
The previous panel measured clustered-MDA the way it usually gets misused, collapse correlated features into clusters, then expand the scores back to individual features. Done that way it still tracks |corr|. But that is not what López de Prado's clustered feature importance is for. The fix is to read the score at the cluster level: group correlated features, permute or sum the whole block, and never re-attribute to members. On the same BTCUSDT triple-barrier task, 351 events, 10 features collapsing into 6 clusters by 1−|ρ| Ward linkage, that is exactly what cures the substitution effect.
The correlated momentum/side block (C1: side, mom6, mom12, rsi) looks only moderate per feature, its best single member earns flat MDI 0.148. Scored as one cluster it is the single dominant driver at 0.385, about 2.6× its best member: the impurity credit the substitution effect had been splitting four ways is reunited. De-dilution is concrete, the Gini of the importance vector rises from 0.205 (flat) to 0.337 (clustered). And cluster-level ranking is far steadier across purged folds: clustered-MDA fold-stability is ~0.30 versus ~0.12 flat, about 2.5×, so clustering works on the axis it is built for.
| cluster | clustered MDI | best member MDI | clustered MDA |
|---|---|---|---|
| C1 | 0.385 | 0.148 | 0.074 |
| C5 | 0.213 | 0.113 | 0.032 |
| C2 | 0.153 | 0.153 | 0.024 |
| C3 | 0.094 | 0.094 | 0.004 |
| C4 | 0.086 | 0.086 | 0.009 |
| C6 | 0.069 | 0.069 | -0.000 |
Cluster-level importance, BTCUSDT (neg-log-loss-tuned RF, purged CV). Clustered MDI sums within-cluster impurity; clustered MDA permutes the whole block. For singleton clusters (C2–C4, C6) the cluster score equals its lone member. The substitution effect is visible only where a cluster holds several correlated features (C1, C5): there the cluster score exceeds its best member, reuniting the split credit. C1 is the single dominant driver at 0.385, about 2.6× its best individual feature.
Claim 3, tune by log-loss, not accuracy
The methodological control runs throughout: each model is tuned both by negative log-loss and by accuracy. Tuning by log-loss yields a significantly smaller overfit gap (pooled mean 0.331 vs 0.421, Wilcoxon p = 8.3e-6), and the effect holds per-model for both families (RF p = 4.4e-4, HGB p = 1.3e-4). The two objectives pick the same configuration only ~60% of the time for RF and ~52% for HGB, so on roughly half the instruments the choice of metric flips the selected model, and when it does, the log-loss choice generalizes better.
…and yet no edge (the honest part)
This is the nuance the prescriptions do not cover: better generalization is necessary, not sufficient, for edge. On this EMA-crossover meta-labeling task neither ensemble produces a Deflated Sharpe Ratio anywhere near the 0.95 bar, the median bet Sharpe is negative for both, and once the Sharpe is deflated against the pooled grid search the DSR collapses to ~0. Bagging wins the generalization contest decisively and the DSR contest not at all (paired p = 0.74). Across all 40 instruments, 0 clear DSR > 0.95; the single near-miss is USDJPY (HGB DSR 0.92), with EURUSD next (RF DSR 0.56), both forex, where the costed task is least adversarial. A clean methods replication with a null edge result, and that separation is exactly the contribution.
Method
- One fixed structural task, a primary EMA(20/60) crossover sets the side; a triple-barrier (profit-take 1.5σ, stop 1.0σ, max-hold 50 bars, σ = causal EWMA vol) scanned on full intrabar OHLC gives the outcome; the meta-label is y = 1[net P&L > 0] after realistic round-trip cost. Ten causal features (side, vol, ma_gap, mom3/6/12, rsi, vol_ratio, ofi, range_atr). This is a model/importance study, not a label sweep.
- Two ensembles, built faithfully: bagging is a RandomForest the AFML way, low max_features to decorrelate trees, regularized leaves under weighting, label-uniqueness sample-weights, and max_samples set to average label uniqueness (the sequential-bootstrap analogue). Boosting is HistGradientBoosting over a small grid of iterations, learning rate, leaf nodes and L2.
- All cross-validation is purged k-fold with embargo, with each label's span mapped from bar space to event space so train rows overlapping a test fold are dropped. Importance stability is measured across CPCV (C(6,2) = 15 paths).
- Each model is hyper-tuned twice, by negative log-loss (headline, AFML 9.4) and by accuracy (control), to compare the resulting overfit gaps. The purged out-of-fold P(profit) sizes each bet; costed P&L feeds a per-bar return series whose Sharpe is deflated against the dispersion of the pooled RF+HGB grid.
- Importance three ways: MDI from a single full-sample fit (the biased in-sample baseline), MDA as purged out-of-fold permutation scored by log-loss, and clustered-MDA permuting whole correlation clusters. Substitution bias is the Spearman ρ between a method's importance and each feature's mean |correlation|; selection stability is the mean pairwise top-3 Jaccard across CPCV paths against a random-selection baseline.
- Information-driven bars throughout, dollar bars for crypto and equities, count bars for forex, with non-zero costs on every bet (crypto flat 7 bp/side; equities and forex a causal time-of-day half-spread plus commission/swap, ≈1–2 bp/side). Two ETFs (XLF, XLV) were dropped by a data-driven minimum-class floor, not by hand.
Notes & limitations
The study deliberately separates “generalizes better / less biased” (which all replicate) from “edge” (which does not appear), and the honest verdict, 0/40 above the 0.95 bar, is the integrity of the work. Equities are the weakest leg: their EMA-crossover bets clear the cost-and-barrier threshold only ~17–23% of the time, so their low in-sample log-loss is partly a class-imbalance artefact and two ETFs failed a minimum-class floor outright. The RandomForest's max_samples is set to average label uniqueness as a first-order analogue of the true sequential bootstrap (faithful to the intent, not the exact procedure). The Deflated-Sharpe trial count is a modest 14 configurations, so the deflation is, if anything, generous, and the DSR is still ~0.
Reproducibility
Every feature and label uses only data up to the current bar; every bet is net of realistic per-side cost; every fold is purged and embargoed. The harness, its self-tests and the full multi-market panel are collected in project 09 of lopez-de-prado-work-review. The explorer on this page is self-contained: the per-instrument panel numbers are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.
Cite
References
The primary sources for the prescriptions reviewed here:
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. (Ch.6 ensembles, Ch.8 feature importance, Ch.9 hyper-parameter tuning.)
- López de Prado, M. (2020). Machine Learning for Asset Managers. Cambridge University Press. (Ch.6 clustered feature importance.)
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).
See also
The multiple-testing apparatus this study's Deflated-Sharpe gate relies on is reviewed in Backtest Overfitting & the Deflated Sharpe Ratio, the selection-discipline theme is developed in The edge is in the process, and the broader body of work is at Research.

