Research review · sample weighting · study 10
Sample Uniqueness & the Sequential Bootstrap
Overlapping triple-barrier labels share future returns, so a row is not an independent observation, we measure how badly that overstates your sample, confirm the sequential bootstrap restores uniqueness, and ask whether fixing it actually helps the model
When you label a bar with a triple barrier, the label of bar t is decided by the price path over the whole interval until the first barrier is touched. Two labels whose intervals overlap are driven by some of the same future returns, so they are not independent, and the one-row-one-observation assumption behind bootstrapping, bagging and sample-size counting is false. López de Prado measures this with concurrency, average uniqueness and effective sample size, and proposes the sequential bootstrap to draw less-redundant samples. We reproduce all of it on real bars across three markets, then ask the question that actually matters: does the correction improve a model's generalization, or just its bookkeeping? The playground below lets you watch the effective sample size collapse as labels overlap.
Advances in Financial ML, Ch. 4
López de Prado (2018) · sample weights & uniqueness
Code & data
lopez-de-prado-work-review / projects / 15
the claim
What López de Prado says
- A triple-barrier label resolves over an interval, so labels with overlapping intervals share future returns and are not IID. Treating each row as one independent observation overstates the information you hold and overfits bootstrapping, bagging and sample-size counting.
- Concurrency (how many labels are live at a bar), average uniqueness (the mean of 1/concurrency over a label's life) and effective sample size (the sum of uniqueness, far below the row count) quantify the problem. The sequential bootstrap draws rows favouring low overlap with rows already drawn, so a bagged model trains on less-redundant, more-unique data.
our result
What we found
- Overlap is severe in every market: mean average uniqueness is 0.44 (crypto), 0.41 (equities) and 0.41 (forex). A median of ~15,000 labels per instrument is worth only ~6,000–6,600 truly independent observations, effective N is ≈41–44% of the raw count.
- The sequential bootstrap raises drawn-sample uniqueness in all 42 of 42 instruments (median lift ≈+0.016 to +0.018). The mechanical claim reproduces universally; the absolute lift is modest at this overlap level.
- Honest verdict: the corrections shrink the in-sample-to-out-of-sample overfit gap (40/42 instruments) and stabilise feature importances (34/42) exactly as theory predicts, but they do NOT improve out-of-sample discrimination (naive bagging wins log-loss in all 42, AUC ≈0.62 naive vs ≈0.56 corrected). This is a robustness treatment, not a performance edge; a methodological result, not a money result.
The result in three lines
Overlap · 42 instruments
Effective N is ~41–44% of the row count
Across 27 crypto perps, 7 US-equity ETFs and 8 forex majors (~20,000 bars each), mean average uniqueness sits at 0.44 / 0.41 / 0.41 by market. A median of ~15,000 overlapping labels carries only ~6,000–6,600 truly independent observations. A workflow that counted rows as IID would assume more than twice the information actually present.
Sequential bootstrap · 42/42
The bootstrap claim reproduces everywhere
A sequential-bootstrap sample has higher average uniqueness than a standard IID bootstrap in every one of the 42 instruments, median lift ≈+0.016 to +0.018. The mechanic is exactly as advertised; the lift is modest because at a mean concurrency near 2.3 there is only so much redundancy to remove. Every kernel (concurrency, uniqueness, sequential draw, CUSUM, weights) is verified bit-identical against an independent reference.
Overfit ↓, AUC ↓
Robustness, not performance
The corrections do what the theory says at the level of the sample, the overfit gap shrinks in 40/42 instruments and feature-importance stability improves in 34/42, but raw out-of-sample discrimination is slightly worse: naive bagging has lower OOS log-loss in all 42 and higher AUC (≈0.62 vs ≈0.56). On this single-series labelling problem the correction trades a little signal for honesty. Methodological result, no tradable sleeve.
Watch the effective sample size collapse
Drag the average label span. As labels stay live longer they overlap more, concurrency climbs, average uniqueness falls, and the effective sample size drops far below the raw row count. The second tab contrasts the sequential bootstrap against the standard IID bootstrap. The curves are illustrative closed forms pinned to the real multi-market run.
The point of the playground is that non-independence is not a corner case, it is the generic state of any interval-resolved label, and it scales smoothly with how long the label takes to resolve. At the realistic operating point (an average span of ~3 bars on a CUSUM event clock) mean concurrency is ≈2.3 and average uniqueness ≈0.44, so the effective sample size is under half the row count. Widen the barriers and lengthen the holds and uniqueness falls monotonically toward zero, exactly as the dose-response figure shows. It is a mechanic illustration of an information-accounting effect, not a tradeable signal.
Demo, overlap, uniqueness & the sequential bootstrap
Drag the label horizon to stretch every triple-barrier span across the timeline. Overlapping spans share future returns, so each label's uniqueness weight (the mean of 1/concurrency over its life) drops. Then draw a bootstrap sample one label at a time: the sequential draw steers away from labels overlapping what it already holds, so its average uniqueness climbs above the naive IID bootstrap. A mechanic illustration of an information-accounting effect, not a tradeable signal.
how many bars a triple-barrier label stays live before its first touch; wider barriers and longer holds push it up
Average uniqueness is the mean of 1/concurrency over a label's life: a label that is mostly alone scores near 1, a label buried in overlap scores near 0. Summing uniqueness across all labels gives the effective sample size, the number of truly independent observations, which sits far below the raw row count. Counting rows as if they were independent overstates how much information you actually hold.
Overlap is severe and effective sample size collapses
Across every instrument in all three markets, mean average uniqueness sits between 0.40 and 0.49. The information content of the labels is roughly 41 to 44 percent of what the raw row count implies. A naive workflow that treated 15,000 overlapping labels as 15,000 independent observations would be assuming more than twice the information actually present.
Effective sample size is the sum of average uniqueness across labels, a median of ≈15,000 labels per instrument is worth only ≈6,000–6,600 truly independent observations. The effect is slightly worse in equities and forex than in crypto, tracking their marginally higher mean concurrency.
Half your sample is double-counted
Plotting effective sample size against the naive row count makes the cost concrete: every market sits at roughly 41 to 44 percent of the diagonal. A median ≈15,000-label instrument carries only ≈6,000–6,600 independent observations. Equities and forex sit marginally lower than crypto, tracking their slightly higher mean concurrency. Any sample-size, confidence-interval or bootstrap calculation that counts rows as IID is off by more than a factor of two.
Wider barriers make it worse
The overlap is driven by interval length. Widening the barriers lengthens the holding intervals, raises concurrency and pushes average uniqueness monotonically down in every market: doubling the barrier width roughly halves uniqueness (≈0.44 → ≈0.20), and doubling it again cuts it to ≈0.07. Whatever the label design, the lesson is the same, the longer a label takes to resolve, the less independent it is.
The sequential bootstrap raises uniqueness, as advertised
In all 42 of 42 instruments without exception, a sequential-bootstrap sample has higher average uniqueness than a standard IID bootstrap sample. The median lift is about +0.016 to +0.018 in uniqueness terms. The mechanical claim reproduces cleanly and universally. The absolute size of the lift is modest here because at a mean concurrency near 2.3 there is only so much redundancy to remove, but the direction never fails.
Does it help the model? Robustness, not discrimination
This is where the result is more interesting than the textbook. The corrected ensemble, sequential-bootstrap bags scaled to mean uniqueness, trees weighted by return attribution, does exactly what the theory says at the level of the sample. The in-sample-to-out-of-sample log-loss gap shrinks in 40 of 42 instruments, and cross-fold feature-importance stability improves in 34 of 42. The model generalises closer to how it trains and explains itself more reproducibly.
But raw out-of-sample discrimination is slightly worse: naive bagging had lower out-of-sample log-loss in all 42 instruments, and higher AUC on average (about 0.62 naive versus 0.56 corrected). Smaller, uniqueness-scaled, reweighted bags see less effective data, so the corrected ensemble captures a bit less signal. On this single-series, weak-signal labelling problem the correction is a robustness and honesty treatment, not a free performance boost.
And the model's self-explanation stabilises
The second half of the robustness story is reproducibility. The cross-fold rank correlation of feature importances is higher with the corrections in 34 of 42 instruments, most strongly in crypto and forex, so the model's account of why it predicts what it predicts is more stable across folds. Removing redundant, over-counted observations stops a handful of repeated rows from dominating an importance ranking. It is the same lesson as the overfit gap: the corrections buy honesty and stability, not extra discrimination.
From uniqueness to deflation: the honest false-discovery penalty
The downstream model benefit was nuanced, but uniqueness has one payoff that is unambiguous and quantitative, it fixes the count you feed into a significance test. A Sharpe ratio's confidence depends on the number of independent observations behind it. Feed in the raw row count and you are claiming more than twice the evidence you hold, so the test statistic, its standard error and any multiple-testing deflation all come out flattering.
Pinning per-observation Sharpe at 0.025 and holding the selection trials fixed, the only thing that moves between the two columns below is the sample size. Treating ~15,000 overlapping labels as independent overstates the Sharpe significance statistic by about 1.50× to 1.57× and widens the Sharpe standard error by the same factor. The Deflated Sharpe Ratio drops by about 0.07, enough to push a borderline strategy from acceptable to marginal. Using the effective sample size, the sum of label uniqueness, about 41 to 44 percent of N, is simply the honest count.
This is the bridge from the labelling chapter to the deflation chapter: uniqueness matters for trial-counting and false-discovery control, not for raw out-of-sample accuracy. The minimum track-record length to clear 95 percent probabilistic Sharpe at this level is ≈4,331 independent observations, a fixed bar, so a track that looks long enough on a row count can still fall short on the honest one.
| market | nominal N | effective N | SE inflation | DSR nominal | DSR effective | DSR drop |
|---|---|---|---|---|---|---|
| crypto | 14,986 | 6,639 | 1.50× | 0.720 | 0.651 | −0.069 |
| equities | 14,742 | 6,111 | 1.55× | 0.718 | 0.645 | −0.073 |
| forex | 14,892 | 6,070 | 1.57× | 0.719 | 0.645 | −0.075 |
Per-observation Sharpe fixed at 0.025; selection trials and trial-Sharpe variance held fixed, Gaussian returns, only the sample size moves between the nominal and effective columns. SE inflation is the ratio of the Sharpe standard error on the nominal count to the effective count (identical to the significance-statistic overstatement); the DSR drop is the fall in Deflated Sharpe Ratio from using the honest effective count.
Per-market and per-instrument detail
The numbers, by market and by instrument
| market | instruments | median N | uniqueness | effective N | effN / N | seq lift | gap naive | gap corr. |
|---|---|---|---|---|---|---|---|---|
| crypto | 27 | 14,986 | 0.443 | 6,645 | 44.3% | +0.016 | 0.0111 | 0.0071 |
| equities | 7 | 14,742 | 0.415 | 6,114 | 41.5% | +0.018 | 0.0163 | 0.0107 |
| forex | 8 | 14,892 | 0.408 | 6,048 | 40.8% | +0.017 | 0.0116 | 0.0080 |
Per-market medians. Average uniqueness equals the effective-N ratio by construction (effN = Σ uniqueness). The sequential-bootstrap lift is the median of (sequential − standard) drawn-sample uniqueness; the overfit gap is the median in-sample-to-out-of-sample log-loss gap, naive vs corrected.
| market | instrument | uniqueness | effective N | concurrency | seq lift | AUC naive | AUC corr. |
|---|---|---|---|---|---|---|---|
| crypto | BTCUSDT | 0.450 | 6,795 | 2.27 | +0.014 | 0.607 | 0.557 |
| crypto | ETHUSDT | 0.452 | 6,770 | 2.25 | +0.015 | 0.613 | 0.568 |
| crypto | SOLUSDT | 0.447 | 6,766 | 2.29 | +0.016 | 0.615 | 0.565 |
| equities | SPY | 0.485 | 7,198 | 2.17 | +0.023 | 0.691 | 0.652 |
| equities | QQQ | 0.490 | 7,220 | 2.14 | +0.022 | 0.686 | 0.651 |
| equities | XLK | 0.399 | 5,851 | 2.74 | +0.018 | 0.590 | 0.517 |
| forex | EURUSD | 0.406 | 6,030 | 2.57 | +0.015 | 0.592 | 0.532 |
| forex | GBPUSD | 0.405 | 6,029 | 2.60 | +0.019 | 0.592 | 0.535 |
| forex | USDJPY | 0.399 | 5,922 | 2.64 | +0.018 | 0.595 | 0.531 |
Representative instruments across the three markets (full per-instrument detail in the repository). Concurrency is the mean number of labels live at a bar; AUC is out-of-sample, naive vs corrected.
Method
- Data, real only, full available history per instrument: 27 Binance USD-margined perpetual futures (1-minute base bars aggregated to dollar bars), 7 liquid US-equity ETFs (SPY, QQQ, IWM, XLK, XLF, XLE, XLV; regular-trading-hours dollar bars), and 8 forex majors (tick bars, since spot FX has no traded volume). Each instrument is reduced to ~20,000 bars so the markets are compared on a common footing.
- Events are sampled with a symmetric CUSUM filter (a label fires whenever the cumulative absolute log-return since the last event crosses a volatility-scaled threshold). Each event is labelled with a triple barrier using full intrabar high/low, never close-only, so the first-touch interval [t0, t1] is the real holding span. The binary label is whether the side-signed gross return at first touch was positive.
- Everything is causal: the label of bar t0 resolves forward in time but is attributed to t0 with its true interval, and every feature at t0 uses only information available at or before t0.
- Information accounting: concurrency c_t is the number of labels live at bar t; average uniqueness of a label is the mean of 1/c_t over its interval; effective sample size is the sum of average uniqueness over all labels. The barrier-width sweep widens the barriers and re-measures uniqueness and concurrency.
- Model question: bagged decision trees under purged k-fold cross-validation with an embargo, so no training label leaks into a test fold. Naive = standard IID bagging, full-size bags, unweighted trees. Corrected = each bag drawn by the sequential bootstrap, bag size scaled to mean average uniqueness, and trees weighted by return-attribution sample weights. We score out-of-sample log-loss and AUC, the in-sample-to-out-of-sample log-loss gap, and cross-fold feature-importance stability.
- Verification: the concurrency scan, average-uniqueness scan, CUSUM filter and sequential-bootstrap draw are Numba-compiled and each is bit-identical against an independent pure-Python reference (max |Δ| = 0). The sequential bootstrap uses an incremental formulation that reproduces the textbook quadratic version's draw sequence exactly while keeping peak memory under 3 GB; two closed-form synthetic checks (disjoint unit intervals → uniqueness 1; N identical intervals → uniqueness 1/N) validate the estimator and are never reported as results.
Notes & honest assessment
This is a methodological contribution, not a “label weighting makes money” result, and it is framed that way throughout. The information-accounting effect is real and reproduces everywhere: overlapping labels carry under half the independent information their count suggests, and the sequential bootstrap reliably raises sample uniqueness. The downstream model benefit is more nuanced and worth stating plainly. The non-IID corrections reduce overfitting and stabilise feature importances, but they do not improve out-of-sample discrimination on this single-series direction-labelling problem; they slightly reduce it. The corrections are best understood as a robustness and honesty treatment, not a performance booster. The study quantifies an information-accounting effect and its consequences, and it does not claim a tradable edge.
Reproducibility
The concurrency and uniqueness scans, the CUSUM filter, the sequential bootstrap, the bagged-ensemble comparison, the figures and the verification harness are collected in project 15 of lopez-de-prado-work-review. The playground on this page is self-contained: the closed forms of the uniqueness, effective-N and sequential-bootstrap relations are encoded in the component and pinned to the run's anchors, so the mechanic it illustrates reproduces exactly on every load.
Cite
References
The primary sources for the method reviewed here:
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. (Chapter 4, “Sample Weights”, and Chapter 7, “Cross-Validation in Finance”.)
- López de Prado, M. (2020). Machine Learning for Asset Managers. Cambridge Elements in Quantitative Finance.
- Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140.
- Politis, D. N., & Romano, J. P. (1994). The Stationary Bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313.
See also
The companion study on multiple-testing deflation is Backtest Overfitting & the Deflated Sharpe Ratio, the causal-graph critique is Causal Factor Investing, the selection-discipline theme runs through The edge is in the process, and the broader body of work is at Research.

