Research review · ensembles & feature importance · study 09

Ensembles & Feature Importance

Across 40 instruments in crypto, equities and forex, bagging generalizes ~4.8× better than boosting, and still produces no tradeable edge

López de Prado makes three sharp, testable claims about machine learning on financial data: bagging generalizes better than boosting on noisy, overlapping labels; the in-sample MDI importance is substitution-biased while out-of-sample MDA is not; and a classifier should be tuned by a probabilistic loss, not accuracy. This study puts all three to a multi-market, fully-costed, leakage-controlled test on 40 instruments, 27 crypto perps, 5 US-equity ETFs and 8 forex majors, with purged cross-validation and a Deflated-Sharpe gate. All three replicate cleanly. None of them buys a tradeable edge.

source

Advances in Financial ML, Ch. 6 & 8

López de Prado (2018) · ensembles & feature importance

Code & data

lopez-de-prado-work-review / projects / 09

the claim

What López de Prado says

Bagging generalizes better than boosting on financial data: labels are noisy and serially overlapping, so boosting chases that noise by re-weighting hard (often mislabeled) cases and overfits, whereas bagging averages decorrelated trees and is more robust.
MDI feature importance is biased, it is in-sample and mechanically inflated for correlated features (the substitution effect); use out-of-sample, permutation-based MDA, and cluster correlated features and permute whole clusters to remove the artefact. Tune classifiers by a probabilistic loss (log-loss), not accuracy.

our result

What we found

All three claims replicate cleanly across 40 instruments in three markets: bagging's IS→OOS generalization gap is ~5× smaller than boosting's (100% of instruments, p=1.8e-12, boosting literally drives OOS log-loss past coin-flip); MDA is essentially bias-free (ρ 0.07 vs MDI 0.49); and log-loss tuning beats accuracy tuning (p=8.3e-6).
Two honest refinements: on this 10-feature set clustered-MDA buys selection stability, not de-biasing (MDA alone is the de-biasing tool), and the selected feature set is only barely more stable than random.
The honest verdict: 0 of 40 instruments clear DSR > 0.95. Better generalization is necessary but not sufficient for edge, this is a clean methodology win with zero tradeable alpha.

The result in three lines

40 instruments · 3 markets

Bagging generalizes ~4.8× better than boosting

Built the textbook way, a RandomForest keeps its in-sample-to-out-of-sample log-loss gap ~4.8× tighter than HistGradientBoosting, and does so on 100% of the 40 instruments (Wilcoxon p = 1.8e-12). Boosting drives in-sample log-loss to ~0.25 by memorizing the noisy, overlapping labels, then pays for it with an out-of-sample log-loss past coin-flip (ln2 ≈ 0.693). The first prescription holds unambiguously.

MDI ρ 0.49 · MDA ρ 0.07

MDI is substitution-biased; MDA is not

Mean-Decrease-Impurity's importance is rank-correlated with how correlated a feature is to the rest (ρ ≈ 0.49), the substitution artefact. The out-of-sample permutation measure, MDA scored by log-loss, shows essentially none (ρ ≈ 0.07; lower than MDI on 88% of instruments, p = 1.2e-6). Several MDI-favored features get negative MDA, permuting them improves out-of-sample log-loss. The twist: clustered-MDA buys selection stability, not de-biasing, on this 10-feature set.

0 / 40 clear DSR > 0.95

Better generalization is not edge

The honest verdict. Neither ensemble produces a Deflated Sharpe Ratio anywhere near the 0.95 publishable bar on any of the 40 instruments, the median bet Sharpe is negative for both, and bagging wins the generalization contest decisively yet the DSR contest not at all (paired p = 0.74). A clean methods replication with a null edge result; that separation is the point.

Claim 1, bagging vs boosting, on 40 instruments

Boosting memorizes the train set, then pays for it out-of-sample

Fig. 1:Bagging beats boosting on every instrument. Left: each point is one instrument's boosting (HGB) IS→OOS log-loss gap against its bagging (RF) gap; all 40 sit above the diagonal, boosting overfits more, everywhere (Wilcoxon p = 1.8e-12). Right: in-sample vs out-of-sample log-loss by market. Boosting drives in-sample log-loss to ~0.25 (well below the coin-flip ceiling ln2 ≈ 0.693) yet its out-of-sample log-loss runs past it (~0.81 in crypto and forex). Bagging keeps both close (~0.57 → ~0.68).

The first prescription holds unambiguously. A RandomForest built the textbook way keeps its in-sample-to-out-of-sample log-loss gap about 4.8× tighter than HistGradientBoosting, and it does so on 100% of the 40 instruments across all three markets (paired Wilcoxon p = 1.8e-12).

The mechanism is exactly the textbook one. On the two genuinely noisy markets, crypto and forex, boosting's extra capacity is spent fitting the noisy, serially-overlapping labels: its in-sample log-loss collapses to ~0.25, far below the coin-flip ceiling, but its out-of-sample log-loss is worse than a coin flip (~0.81 > ln2 ≈ 0.693). The bagged forest never memorizes, it holds in-sample at ~0.57 and out-of-sample at ~0.68, staying at or under coin-flip out of sample. Equities are low for both models for a different reason: a class-imbalanced label (base rate ~0.17) the forest can trivially predict in-sample.

bagging IS→OOS gap

0.114

RandomForest, NLL-tuned

boosting IS→OOS gap

0.539

HistGradientBoosting

gap ratio

4.8×

boosting overfits more

RF < HGB

100%

of 40 instruments

paired Wilcoxon

1.8e-12

p-value

DSR > 0.95

0 / 40

either ensemble

market	RF IS	RF OOS	HGB IS	HGB OOS
Crypto27 perps	0.569	0.680	0.255	0.809
Equities5 ETFs	0.134	0.557	0.153	0.659
Forex8 majors	0.569	0.677	0.277	0.808

Median log-loss per market, NLL-tuned. Boosting's out-of-sample log-loss is past coin-flip (ln2 ≈ 0.693) in crypto and forex, worse than a coin flip. Bagging's stays at or below it everywhere.

Explore the overfit

The same effect, one instrument or one market at a time. Every number is the real per-instrument panel result, NLL-tuned, purged-CV, costed.

Pick a market to see the median of its panel, or drill into a single instrument. The left chart contrasts in-sample and out-of-sample log-loss for the bagged forest and the boosted trees, against the coin-flip line; the right chart shows the substitution bias of the three importance measures. Boosting's in-sample bar sits far below coin-flip, it has memorized the labels, while its out-of-sample bar typically runs past it.

Demo, Bagging vs boosting overfit explorer

Boosting drives in-sample log-loss far below coin-flip (it memorizes the train set) yet its out-of-sample log-loss runs past coin-flip. The bagged forest keeps in-sample and out-of-sample close. Right panel: only MDA's importance is free of substitution bias.

NLL-tuned · purged-CV · costed

Market (median of panel)

… or one instrument

RF (bagging) IS→OOS gap

0.113

HGB (boosting) IS→OOS gap

0.550

boosting / bagging gap ratio

4.9×

HGB out-of-sample vs coin-flip

0.808 > ln2, worse than a coin flip

Crypto: median over 27 instruments. Boosting's in-sample log-loss is 0.255 (well below coin-flip ln2 ≈ 0.693, it has memorized the labels), but its out-of-sample log-loss is 0.808, past coin-flip. The bagged forest holds in-sample at 0.569 and out-of-sample at 0.680, a gap 4.9× smaller. Substitution bias ρ(importance, |corr|): MDI 0.491, MDA 0.115, clustered-MDA 0.552, only MDA is near zero.

The panel reproduced bit-identically across runs (random_state = 0 throughout, BLAS/OMP pinned to one thread per process). The non-sklearn hot loops, the sequential-bootstrap draw and the co-event count / average label uniqueness, are JIT-compiled and verified bit-identical against pure-Python references.

Fig. 2:MDI (in-sample, biased) versus MDA (out-of-sample permutation) per feature, one instrument per market. Where MDA goes negative, e.g. mom6/mom12/vol in equities and forex, permuting the feature improves out-of-sample log-loss: the model was over-relying on it. MDI ranks those same features highly.

Claim 2, MDI is biased, MDA is not

Mean-Decrease-Impurity is computed in-sample and is mechanically inflated for correlated features: a method's importance ends up rank-correlated with how correlated each feature is to the rest (ρ ≈ 0.49, highest in forex at 0.67). The momentum block and the vol block split the impurity credit and inflate each other. The out-of-sample permutation measure, MDA scored by log-loss, shows essentially none of this (ρ ≈ 0.07; lower than MDI on 88% of instruments, p = 1.2e-6). Strikingly, several features MDI ranks highly earn negative MDA: permuting them improves out-of-sample log-loss, so the model was over-relying on noise. MDI keeps them; MDA flags them as harmful.

The honest twist is that clustered-MDA does not remove the bias on this feature set. With only 10 features collapsing into ~5 clusters, the correlated features land in the same cluster, so whole-cluster permutation still attributes large importance to that correlated cluster, and expanding it back to features re-introduces the |corr| association (ρ ≈ 0.52, statistically indistinguishable from MDI, p = 0.22). What clustering does buy is selection stability: collapsing substitutes raises the top-3 Jaccard across CPCV paths from MDA's 0.26 to 0.46 (p = 1.8e-12). MDA's own selection is genuinely above the random baseline (0.20) on 98% of instruments, but only barely stable, one should not over-interpret “the top features” on this task.

Fig. 3:Left: substitution bias ρ(importance, |corr|) across instruments, MDI and clustered-MDA center near +0.5, only MDA straddles zero. Right: top-3 selection stability (mean Jaccard across the 15 CPCV paths). MDI's 0.58 is inflated by the same bias that makes it untrustworthy; clustering raises MDA's 0.26 to 0.46; MDA itself is only just above the 0.20 random baseline.

method	bias ρ	stability	note
MDI	0.485	0.582	in-sample baseline, biased
MDA	0.073	0.263	out-of-sample permutation, unbiased
clustered-MDA	0.517	0.463	stability, not de-biasing
random baseline	,	0.201	3 of 10 features at random

Median over the 40-instrument panel. Bias ρ is the Spearman correlation between a method's importance and each feature's mean |correlation|; stability is the mean top-3 Jaccard across the 15 CPCV paths. The lesson: with a small, moderately-correlated feature set, MDA, not clustered-MDA, is the de-biasing tool; clustering matters more when there are many tightly-correlated features to absorb.

The fix, measured the right way

Score importance per cluster, not per feature

Fig. 5:Clustered feature importance on BTCUSDT. Left: Ward dendrogram on 1−|corr|, the side/mom6/mom12/rsi block and the vol/range_atr block merge tightly. Middle: clustered MDI (sum within cluster) and clustered MDA (permute the whole block); the dot marks the best single member's flat MDI, C1 jumps from 0.148 (best member) to 0.385 (cluster). Right: rank stability across purged folds, clustering lifts MDI from 0.89 to 0.97 and MDA from 0.12 to 0.30.

The previous panel measured clustered-MDA the way it usually gets misused, collapse correlated features into clusters, then expand the scores back to individual features. Done that way it still tracks |corr|. But that is not what López de Prado's clustered feature importance is for. The fix is to read the score at the cluster level: group correlated features, permute or sum the whole block, and never re-attribute to members. On the same BTCUSDT triple-barrier task, 351 events, 10 features collapsing into 6 clusters by 1−|ρ| Ward linkage, that is exactly what cures the substitution effect.

The correlated momentum/side block (C1: side, mom6, mom12, rsi) looks only moderate per feature, its best single member earns flat MDI 0.148. Scored as one cluster it is the single dominant driver at 0.385, about 2.6× its best member: the impurity credit the substitution effect had been splitting four ways is reunited. De-dilution is concrete, the Gini of the importance vector rises from 0.205 (flat) to 0.337 (clustered). And cluster-level ranking is far steadier across purged folds: clustered-MDA fold-stability is ~0.30 versus ~0.12 flat, about 2.5×, so clustering works on the axis it is built for.

cluster	clustered MDI	best member MDI	clustered MDA
C1	0.385	0.148	0.074
C5	0.213	0.113	0.032
C2	0.153	0.153	0.024
C3	0.094	0.094	0.004
C4	0.086	0.086	0.009
C6	0.069	0.069	-0.000

Cluster-level importance, BTCUSDT (neg-log-loss-tuned RF, purged CV). Clustered MDI sums within-cluster impurity; clustered MDA permutes the whole block. For singleton clusters (C2–C4, C6) the cluster score equals its lone member. The substitution effect is visible only where a cluster holds several correlated features (C1, C5): there the cluster score exceeds its best member, reuniting the split credit. C1 is the single dominant driver at 0.385, about 2.6× its best individual feature.

Fig. 4:Overfit gap when tuning by log-loss (y) versus accuracy (x), per instrument. Most points sit below the diagonal, log-loss tuning overfits less (AFML 9.4). The effect is strongest where the two objectives disagree on the chosen configuration.

Claim 3, tune by log-loss, not accuracy

The methodological control runs throughout: each model is tuned both by negative log-loss and by accuracy. Tuning by log-loss yields a significantly smaller overfit gap (pooled mean 0.331 vs 0.421, Wilcoxon p = 8.3e-6), and the effect holds per-model for both families (RF p = 4.4e-4, HGB p = 1.3e-4). The two objectives pick the same configuration only ~60% of the time for RF and ~52% for HGB, so on roughly half the instruments the choice of metric flips the selected model, and when it does, the log-loss choice generalizes better.

…and yet no edge (the honest part)

This is the nuance the prescriptions do not cover: better generalization is necessary, not sufficient, for edge. On this EMA-crossover meta-labeling task neither ensemble produces a Deflated Sharpe Ratio anywhere near the 0.95 bar, the median bet Sharpe is negative for both, and once the Sharpe is deflated against the pooled grid search the DSR collapses to ~0. Bagging wins the generalization contest decisively and the DSR contest not at all (paired p = 0.74). Across all 40 instruments, 0 clear DSR > 0.95; the single near-miss is USDJPY (HGB DSR 0.92), with EURUSD next (RF DSR 0.56), both forex, where the costed task is least adversarial. A clean methods replication with a null edge result, and that separation is exactly the contribution.

Method

One fixed structural task, a primary EMA(20/60) crossover sets the side; a triple-barrier (profit-take 1.5σ, stop 1.0σ, max-hold 50 bars, σ = causal EWMA vol) scanned on full intrabar OHLC gives the outcome; the meta-label is y = 1[net P&L > 0] after realistic round-trip cost. Ten causal features (side, vol, ma_gap, mom3/6/12, rsi, vol_ratio, ofi, range_atr). This is a model/importance study, not a label sweep.
Two ensembles, built faithfully: bagging is a RandomForest the AFML way, low max_features to decorrelate trees, regularized leaves under weighting, label-uniqueness sample-weights, and max_samples set to average label uniqueness (the sequential-bootstrap analogue). Boosting is HistGradientBoosting over a small grid of iterations, learning rate, leaf nodes and L2.
All cross-validation is purged k-fold with embargo, with each label's span mapped from bar space to event space so train rows overlapping a test fold are dropped. Importance stability is measured across CPCV (C(6,2) = 15 paths).
Each model is hyper-tuned twice, by negative log-loss (headline, AFML 9.4) and by accuracy (control), to compare the resulting overfit gaps. The purged out-of-fold P(profit) sizes each bet; costed P&L feeds a per-bar return series whose Sharpe is deflated against the dispersion of the pooled RF+HGB grid.
Importance three ways: MDI from a single full-sample fit (the biased in-sample baseline), MDA as purged out-of-fold permutation scored by log-loss, and clustered-MDA permuting whole correlation clusters. Substitution bias is the Spearman ρ between a method's importance and each feature's mean |correlation|; selection stability is the mean pairwise top-3 Jaccard across CPCV paths against a random-selection baseline.
Information-driven bars throughout, dollar bars for crypto and equities, count bars for forex, with non-zero costs on every bet (crypto flat 7 bp/side; equities and forex a causal time-of-day half-spread plus commission/swap, ≈1–2 bp/side). Two ETFs (XLF, XLV) were dropped by a data-driven minimum-class floor, not by hand.

Notes & limitations

The study deliberately separates “generalizes better / less biased” (which all replicate) from “edge” (which does not appear), and the honest verdict, 0/40 above the 0.95 bar, is the integrity of the work. Equities are the weakest leg: their EMA-crossover bets clear the cost-and-barrier threshold only ~17–23% of the time, so their low in-sample log-loss is partly a class-imbalance artefact and two ETFs failed a minimum-class floor outright. The RandomForest's max_samples is set to average label uniqueness as a first-order analogue of the true sequential bootstrap (faithful to the intent, not the exact procedure). The Deflated-Sharpe trial count is a modest 14 configurations, so the deflation is, if anything, generous, and the DSR is still ~0.

Reproducibility

Every feature and label uses only data up to the current bar; every bet is net of realistic per-side cost; every fold is purged and embargoed. The harness, its self-tests and the full multi-market panel are collected in project 09 of lopez-de-prado-work-review. The explorer on this page is self-contained: the per-instrument panel numbers are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.

Cite

Cite as

Gatto, D. V. (2026). Ensembles and Feature Importance: A Multi-Market Replication of Bagging-versus-Boosting and MDI/MDA Prescriptions. Working paper.

@techreport{gatto2026ensembles,
  author      = {Gatto, Daniel V.},
  title       = {Ensembles and Feature Importance: A Multi-Market
                 Replication of Bagging-versus-Boosting and MDI/MDA
                 Prescriptions},
  year        = {2026},
  type        = {Working paper},
  note        = {Review of Lopez de Prado's AFML Ch.6/8/9 and ML4AM Ch.6}
}

References

The primary sources for the prescriptions reviewed here:

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. (Ch.6 ensembles, Ch.8 feature importance, Ch.9 hyper-parameter tuning.)
López de Prado, M. (2020). Machine Learning for Asset Managers. Cambridge University Press. (Ch.6 clustered feature importance.)
Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).