M/09 — Pre-registered surface-stability test

Does in-sample smoothness predict out-of-sample skill?

A pre-registered empirical test of the verbal corpus claim that smooth in-sample Sharpe ridges generalise out-of-sample and sharp spikes don't. SOL pilot: H₀ retained at the aggregate level, with strong family-level heterogeneity. DOGE + BTC replication in progress.

TL;DR

What it is: a pre-registered, one-sided test of whether σ_micro(θ) — the in-sample Sharpe std across a fixed five-perturbation suite — negatively predicts R_K, the mean OOS Sharpe of the K=5 IS-top strategies in the next walk-forward window. The verbal corpus claim says it should.
How it's locked: pre-registration committed at 8e62171 before any DOGE or BTC parquet existed; post-pilot hyperparameter lock at 1badde4, also before replication data was loaded. Three deviations from pre-reg are documented honestly in the methodology section.
What the SOL pilot found: n_cells=42, β_smooth = +0.115 (wrong sign for H₁), p_perm = 0.843, f² = 0.038. Cross-implementation Δβ < 1e-13. H₀ retained at the aggregate level — but ATR strongly supports H₁ (β=−0.33, p_BH<0.0001) and EMA/STOCHK/MACD weakly contradict. The claim is family-conditional, not universal.
What's open: the DOGE_30m_21W + BTC_30m_27W replication is running at the time this page was drafted. Article-elevation criterion is locked: replication p_perm < 0.01, sign-consistent β, f² ≥ 0.04. The corpus-page sentence is rewritten in three pre-specified branches depending on which side of that line the replication lands.

The hypothesis

The Daru Finance corpus page contains the following verbal claim: “smooth, broad ridges mean the family has a stable performance basin under perturbation of its parameters; sharp spikes flanked by collapse mean the family is brittle — its in-sample peaks are likely overfit rather than real.” That sentence has been printed on the corpus page for some time without an empirical test attached. This project is the test. It is pre-registered, the hyperparameters were locked before the replication data was loaded, and the writeup is committed to publishing whichever way the result lands.

Operationally: pick a strategy family F, an asset a, and a walk-forward window w. For every strategy θ in the IS top-K of (F, a, w), measure how much the in-sample Sharpe wiggles when the backtest is re-run under a fixed five-perturbation suite — entry-confirmation, fee, slippage, entry+indicator. The std of those five Sharpes is σ_micro(θ). Aggregate to the cell level by averaging σ_micro across the IS top-K. The OOS robustness metric R_K is the mean base-perturbation OOS Sharpe of the same K=5 strategies in window w+1. The two formal definitions are:

σ_{micro} (θ) = std {S^{I S} (θ, w, p) : p \in {base, ENT, FEE, SLI, ENT + IND}}

R_{K} (F, a, w) = \frac{1}{K} θ \in Top_{K}^{I S} (F, a, w) \sum S^{O O S} (θ, w + 1, base), K = 5.

H₀: after controlling for family / asset / window fixed effects, β_smooth in R_K ~ σ_micro is zero or positive. H₁ (one-sided): β_smooth < 0 — IS-stable strategies generalise better OOS, exactly as the verbal claim predicts. K = 5 matches the firm’s existing portfolio convention (CheckerWFO).

Pre-registration

Two commits matter. The first, 8e62171552f284b6dbbe6a5d450d41618935fd76, is the initial commit of analysis-plan.md and locks the hypothesis, the four sensitivity metrics (σ_micro, σ_micro,z, σ_param, σ_combined), the four robustness metrics (R_K, R_PT, R_ρ, R_lift), the inclusion/exclusion rules, the estimator family, the permutation-null protocol, the BH multiplicity rule across the secondary 15 (σ, R) pairs, and the article-elevation thresholds. It was committed before any parquet was generated.

The second, 1badde4, is analysis-plan-locked.md: the post-pilot hyperparameter lock. It freezes the σ_param neighbourhood definition, the |Sharpe| numerical-artifact cap, the OLS-with-family-clustered-SE estimator that replaces MixedLM, and the within-transform cross-check that replaces R’s lme4. It was committed after the SOL pilot finished and before any DOGE or BTC parquet existed; directory-mtime checks on data/ verify this.

The design is a pilot–replication split. The SOL_1h_7W partition is the pilot, used purely to lock hyperparameters that the pre-reg left open (e.g., the exact “1-step” rule for σ_param and the clustering choice for the cluster-robust SE). The pilot has insufficient power to confirm anything (n_cells = 42, ≈ 0.55 power at f² = 0.05). Confirmation comes from the pooled DOGE_30m_21W + BTC_30m_27W replication, n_cells ≈ 322, > 0.95 power at f² = 0.05. The replication is run with the locked hyperparameters and no further tuning.

Methodology

The corpus is parsed in Rust: src/parse_corpus.rs walks each <asset>/<family>/<strat>/<strat>.txt tree, regex-extracts every Wxx line into a long-format row keyed by (asset, family, strategy, base_param, transformation, confluence, sl_regime, window, sample, perturbation), and emits a parquet per asset. The parser correctness is spot-checked by hand on five random text files per asset against the parquet rows.

The cell-level dataframe has one row per (family, asset, window). For each cell, σ_micro is computed at the strategy level first and then averaged across the IS top-K, and R_K is the mean base-OOS Sharpe of the same K=5 strategies one window later:

\overset{σ}{ˉ}_{micro} (F, a, w) = \frac{1}{K} θ \in Top_{K}^{I S} (F, a, w) \sum σ_{micro} (θ) .

Inclusion: a cell is kept iff (i) window w+1 exists, (ii) ≥ 30 strategies in (F, a) have all five perturbation variants in window w, and (iii) at least K=5 of those have non-null IS Sharpe. The RSI_LEVEL family on SOL has only 12 strategies and is excluded from the primary cell-level model by the n ≥ 30 floor; it is reported in the strategy-level sensitivity instead.

Estimator and inference

The pre-registered model was a linear mixed-effects fit with random intercepts on family, asset, and window. In practice, at n_windows = 7 in the SOL pilot, the random-effects covariance for window was singular and MixedLM’s REML optimiser failed to converge. The locked replacement is OLS with family-clustered robust standard errors and explicit fixed effects on family, window, and asset:

R_{K} (F, a, w) = β_{0} + β_{smooth} \overset{σ}{ˉ}_{micro} (F, a, w) + γ_{F} + γ_{a} + γ_{w} + ε,

with $Var (\hat{β}) = (X^{⊤} X)^{- 1} (\sum_{c} X_{c}^{⊤} \overset{ε}{^}_{c} \overset{ε}{^}_{c}^{⊤} X_{c}) (X^{⊤} X)^{- 1}$ the cluster-robust sandwich estimator at the family level. Inference uses a permutation null with M = 1000: within (a, w+1) cells, shuffle the OOS-Sharpe → strategy assignment, recompute R_K, refit, and record β_smooth,null. The empirical one-sided p-value is $p_{perm} = (# {β_{smooth,null} \leq \hat{β}_{smooth}} + 1) / (M + 1)$ .

The pre-reg called for an R/lme4 cross-language verification. R is not installed in the environment and bringing it in would have broken the firm’s tri-language standard (Rust + Python + R kept available only when all three are already supported on the host). The locked substitute is a within-transform OLS implemented from scratch in Python: subtract group means for F / a / w, fit OLS on the residuals, and compute the analytic SEs. The two implementations must agree on β_smooth to three decimal places. On the SOL pilot they agreed to 1.1 × 10⁻¹³.

Documented deviations from pre-registration

Three deviations are reported honestly in the results, in the same place the pre-reg said they would be:

|Sharpe| > 100 cap. The pre-reg sentinel rule excluded only nan and ±inf. After the parser ran, ≈ 1.4 × 10¹⁶-magnitude Sharpe values appeared on rows with trades = 2: the backtester divides annualised return by a near-zero realised vol on those trades and produces numerical artifacts. The locked rule drops rows with |Sharpe| > 100. This is a documented exploratory deviation.
OLS + family-clustered SEs replacing MixedLM. Singular random-effects covariance at n_windows = 7 broke REML. The locked estimator preserves the cluster-robust intent (clusters at family) and reproduces the same fixed-effects structure as the pre-reg model.
Within-transform OLS replacing R / lme4 cross-check. R unavailable; analytical demeaning was implemented in Python instead. The Δβ < 1e-13 agreement on the SOL pilot is the verification.

SOL pilot results

The pilot ran on SOL_1h_7W, the firm’s 7-window walk-forward partition on SOL/USDT 1h bars. After the n ≥ 30 strategies-per-cell floor, n_cells = 42 across 7 families × 6 transitions (RSI_LEVEL excluded). The primary fit is:

β_smooth = +0.115 (standardised σ_micro)
SE = 0.121, t = +0.95
p_one-sided (H₁: β < 0) = 0.829
p_perm (M = 1000) = 0.843
f² = 0.038 (just below the 0.04 article-elevation floor)
Cross-implementation: β_within = +0.115, Δ = 1.1 × 10⁻¹³

At the aggregate level, the sign of β_smooth is the opposite of what H₁predicts: on SOL_1h_7W, the IS-spikier cells, on average, slightly out-perform the IS-smoother cells in the next window. This is far from significant — the permutation null is centred at −0.004 with std 0.118, and the observed +0.115 sits comfortably inside it (p_perm = 0.843). H₀ is retained at the aggregate level on the pilot.

Primary scatter of sigma_micro vs R_K on the SOL pilot, coloured by family — Fig. 1 —SOL_1h_7W pilot. Each point is a (family, window) cell: x = mean σ_micro across the IS top-K, y = R_K (mean base-OOS Sharpe in window w+1). Colour by family. The within-cluster regression line has β_smooth = +0.115 — slope is positive, opposite to the directional H₁. ATR cells (dark) and RSI cells (orange) form a clear negative sub-slope; EMA, STOCHK, and MACD pull the aggregate the other way.

Permutation null density of beta_smooth with the observed value marked — Fig. 2 —Permutation null (M = 1000) for β_smooth on the SOL pilot. Null mean −0.004, null std 0.118; quantiles q05 = −0.20, q95 = +0.19. Observed β_smooth = +0.115 sits at the 84th percentile of the null. The one-sided p_perm against H₁ is 0.843. The null is well-calibrated: under construction-by-shuffle, the empirical Type-I rate at α = 0.05, 0.01 reproduces the nominal levels.

Per-family breakdown

Aggregating obscures real heterogeneity. Splitting by family and BH-correcting across the eight families gives a sharper picture:

ATR: β = −0.33, p_BH < 0.0001. Strongly hypothesis-supporting — within ATR, IS-smoother cells beat IS-spikier cells out-of-sample by a wide margin.
RSI: β = −0.38, p_BH = 0.28. Sign is right, magnitude is large, but n_cells inside RSI is small enough that BH does not reject.
PPO, SMA: weakly negative β, neither close to BH significance.
EMA, STOCHK, MACD: positive β. EMA and STOCHK are the two families that drive the aggregate sign-flip. Both are momentum-style families with shallower IS Sharpe surfaces; the σ_micro = 0 region inside them seems to coincide with a flat-zero R_K region rather than a high-R_K ridge.

Read the pilot honestly: at the population level, smoothness does not predict OOS skill on SOL. At the family level, it does inside ATR (and probably RSI, given more cells), and it doesn’t inside the momentum families. The verbal corpus claim is, on this evidence, family-conditional rather than universal.

The surfaces themselves

Numbers settle the empirical question; the surfaces show what we are calling smooth and spiky. Below: the IS Sharpe landscape for each family, computed at the mid walk-forward window (W = 4) on SOL/USDT 1H, displayed over the (transformation × confluence) parameter grid. Each panel is annotated with σ_L (a discrete-Laplacian smoothness metric — lower is geometrically smoother) and the per-family β_smooth from the replication pool (negative is hypothesis-supporting). Sorted smooth → spiky.

PPO

β = +0.16

σ_L = 0.952n = 223

ATR

β = -0.85

σ_L = 1.120n = 223

STOCHK

β = +0.42

σ_L = 1.289n = 223

RSI

β = -0.09

σ_L = 1.328n = 223

MACD

β = +0.63

σ_L = 1.647n = 220

EMA

β = +0.32

σ_L = 1.675n = 223

SMA

β = +0.01

σ_L = 1.899n = 223

Each panel: IS Sharpe surface over the (transformation × confluence) grid for one family on SOL/USDT 1H, mid walk-forward window, median over base-parameter and SL-regime slices. Camera orbits 360° in 20 s. σ_L = discrete-Laplacian smoothness (lower = smoother). β = per-family OLS slope of R_K on σ_micro from the replication pool (negative = hypothesis-supporting; positive = anti-hypothesis). Videos are lazy-played: decoding pauses when the section scrolls offscreen. Sorted smoothest → spikiest by σ_L.

The smoothest surfaces by σ_L are PPO (0.95) and ATR (1.12); the spikiest are EMA (1.68) and SMA (1.90). If the verbal corpus claim were a clean linear law, these σ_L rankings should track the per-family β. They don’t, exactly. ATR, the smoothest of the families with a large negative β, is on the hypothesised side; MACD and EMA, the families with the largest positive βs, are visibly spikier — also on the hypothesised side. But PPO is the smoothest surface and yet carries β = +0.16, and SMA is the spikiest yet carries β ≈ 0. The relationship between geometric smoothness at one window and predictive smoothness across all transitions is real but partial. That partiality is exactly what the H₀-retained aggregate result reflects.

Per-family quadrant of sigma_micro vs R_K with within-family regression lines — Fig. 3 —Per-family small-multiple of σ_micro vs R_K on the SOL pilot. ATR (top-left, β = −0.33***) is the cleanest within-family hypothesis-supporting case. RSI shows the same sign with insufficient cells to reject BH. EMA, STOCHK, and MACD show positive within-family slopes — the in-sample-spiky cells in those families don't pay an OOS cost on SOL. The aggregate β = +0.115 is a population-weighted blend of these three regimes.

Replication on DOGE + BTC

The replication parses DOGE_30m_21W (Δt = 30m, 21 walk-forward windows) and BTC_30m_27W (30m, 27 windows) and pools them. The locked hyperparameters from 1badde4 apply unchanged: same five-perturbation suite, same K = 5, same n ≥ 30 floor, same |Sharpe| > 100 drop, same OLS + family-clustered SEs, same M = 1000 permutation null. Pooled n_cells ≈ 322; per-asset triangulation breaks out SOL, DOGE, and BTC separately.

Pooling DOGE (n_cells = 140) and BTC (n_cells = 182) yields β_smooth = −0.058 with cluster-robust SE 0.126, one-sided permutation p_perm = 0.272 (M = 1000) and effect size f² = 0.0002. The sign now matches the hypothesis — for the first time across the three estimates run on this corpus — but the magnitude is small relative to the cross-cell residual variance and the result sits comfortably inside the null. The cross-implementation check between OLS and the within-transform diverges to |Δβ| = 0.016 in the larger sample (vs 1×10⁻¹³ on the pilot); the gap is driven by the contrast coding in the unbalanced asset / window design, and both estimators agree on sign and order of magnitude.

The per-asset triangulation is where the picture becomes unambiguous. SOL β = +0.115 (n = 42); DOGE β = +0.163 (n = 140); BTC β = −0.210 (n = 182). The sign flips between DOGE and BTC, and neither is significant on its own. There is no coherent population-level relationship here. The pooled β < 0 is the weighted algebraic average of a positive bet and a negative bet; it is not evidence for the hypothesis.

Per-asset triangulation of beta_smooth across SOL, DOGE, BTC — Fig. 4 —Per-asset triangulation of β_smooth with cluster-robust 95% confidence intervals on the primary specification. SOL pilot β = +0.115 (n = 42); DOGE β = +0.163 (n = 140); BTC β = −0.210 (n = 182). The sign flip across asset partitions is the strongest single-line summary of the result: under the operational definition of σ_micro that the pre-reg specifies, perturbation sensitivity does not predict next-window OOS Sharpe in a way that survives changing the asset.

The per-family breakdown on the replication pool (BH correction within the family of seven secondary tests; RSI_LEVEL is excluded by the n ≥ 30 floor) tells a more nuanced story. ATR β = −0.850 (uncorrected p = 0.044, p_BH = 0.310) — the single largest within-family negative slope on the entire study, and the only family with a replicating sign across pilot and replication. RSI β = −0.090 (p_BH = 0.888); EMA, MACD, PPO, SMA, STOCHK all positive (β between +0.011 and +0.631; all p_BH = 0.888). After BH correction nothing crosses α = 0.05, but the structural finding — ATR (and weakly RSI) behaves as the verbal claim predicts; momentum families behave the opposite way — is consistent across pilot and replication.

Article-elevation criterion (locked, evaluated)

Pre-reg required all three of:

replication p_perm < 0.01 — got 0.272. Fails.
sign-consistent β_smooth < 0 across SOL / DOGE / BTC — got +0.115, +0.163, −0.210. Fails.
f² ≥ 0.04 — got 0.0002. Fails.

The criterion is missed on all three thresholds simultaneously. The result does not elevate to an article; the corpus-page sentence is rewritten per pre-reg branch 3 below.

What this means for the corpus claim

Three branches were pre-specified, with the rewriting plan for /corpus committed in advance:

Strong support (replication p_perm < 0.01, β < 0, f² ≥ 0.04). The corpus sentence keeps its current form and gains a footnote linking here. A standalone article on the family-conditional structure (ATR/RSI vs momentum families) is published.
Weak support (replication p_perm < 0.05, sign-consistent, f² < 0.04 or one of the three criteria misses). The corpus sentence is rewritten to: “within mean-reversion families, smooth in-sample ridges show weak evidence of generalising better OOS than sharp spikes; momentum families do not show this pattern, and the population-level claim is not supported.” Lab page only; no separate article.
Null / sign-flipped (replication NS or sign-consistent positive). The corpus sentence loses its absolutist phrasing and is replaced with: “in-sample smoothness under microstructural perturbation is not a population-level predictor of out-of-sample skill on the corpus we have tested. Family-level analysis is required.” The lab page reports the null cleanly and points the reader at the per-family heterogeneity figure.

The pilot already gives reason to expect the writeup will land somewhere between the second and third branch. ATR is real; the population-level claim probably is not. The replication’s job is to say which of the two is the headline.

Live demos

The 3D surface, live

The MP4s above are baked-in views of seven specific corpus surfaces. The demo below lets you shape a surface yourself: dial the smoothness slider toward 1 to get a wide bell (the “ATR look”), drag it toward 0 to get a sharp narrow peak surrounded by collapse (the “MACD look”). The perturbation-noise slider adds high-frequency jitter, the empirical analogue of σ_micro. The σ_L readout updates in real time. The mesh is intentionally low-poly (32×32 = 1,024 vertices, single material, one wireframe pass) so that even a weak phone can rotate it at 60 fps without breaking a sweat.

smoothness (1 = wide bell · 0 = sharp peak)

0.85

perturbation noise (high-freq jitter)

0.020

σ_L (Laplacian energy)

0.0125

Lower σ_L = geometrically smoother; the corpus values ranged 0.95 (PPO, smoothest) to 1.90 (SMA, spikiest).

presets

auto-rotate

Drag the surface to rotate freely; wheel / pinch to zoom. Auto-rotate stops once you interact.

INITIALIZING WEBGL…

Synthetic IS Sharpe surface, 32×32 mesh, vertex-coloured by height. The smoothness slider interpolates between a wide low bell (smooth basin under perturbation — hypothesised to generalise OOS) and a narrow tall spike with flat collapse around it (brittle peak — hypothesised to overfit). The perturbation noise slider adds high-frequency jitter that approximates what σ_micro measures empirically. σ_L in the panel updates live so you can see how the geometry and the smoothness metric move together.

The (σ, R) scatter, live

Different lens, same project. This second demo lets you scrub the simulated relationship between σ_micro and R_K for a single family under a tunable underlying β and sample size. It is not a fit to the corpus — the corpus result is in the figures above and in the replication parquet — but it builds intuition for what β_smooth = ±0.3 actually looks like as a scatter, and how easy it is to be fooled by 42 noisy cells.

true β (population)

β = -0.30

n cells

n = 42

residual noise σ

σ_ε = 0.60

fitted slope

H₀ retained (β̂ = -0.09, p = 0.732)

try the per-family β observed on the corpus

Synthetic. The dashed grey line is the true β; the amber line is the fitted slope on this sample. Resample to see how often the sign flips at small n. The pre-registered article criterion was f² ≥ 0.04 — try setting β = −0.3 with n = 42 vs n = 322 and watch how visibility-of-effect changes. The corpus pilot was n = 42; the replication pool was n = 322.

Reproducibility

DaruFinance / quant-surface-stability

Python — open source reference implementation

Minimal invocation

# Reproduce: pre-reg locked at 8e62171; hyperparameter lock at 1badde4
git clone https://github.com/DaruFinance/quant-surface-stability.git
cd quant-surface-stability

# 1. Parse strategy text dumps -> parquet (Rust, ~1m per asset)
cargo run --release --bin parse_corpus -- --asset SOL_1h_7W

# 2. Build cell-level (family x asset x window) metrics
python scripts/compute_metrics.py --asset SOL_1h_7W

# 3. Primary fit: sigma_micro x R_K, OLS + family-clustered SEs,
#    permutation null with M=1000, within-transform cross-check.
python scripts/fit.py --asset SOL_1h_7W --primary
# beta_smooth = +0.115   p_perm = 0.843   f^2 = 0.038
# cross-impl. delta = 1.1e-13   --> H_0 retained at aggregate level

References

[1]Bailey, D. H. & López de Prado, M. (2014). The probability of backtest overfitting. Journal of Computational Finance 20(4), 39–69.
[2]Harvey, C. R. & Liu, Y. (2014). Backtesting. Journal of Portfolio Management 42(1), 13–28.
[3]Carrasco, M. & Maciel, L. (2020). Robustness of in-sample optimisation in trading rule selection: a parameter-stability perspective. Quantitative Finance 20(11), 1799–1816.
[4]Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57(1), 289–300.

All projects View on GitHub