M/09 — Pre-registered surface-stability test
Does in-sample smoothness predict out-of-sample skill?
A pre-registered empirical test of the verbal corpus claim that smooth in-sample Sharpe ridges generalise out-of-sample and sharp spikes don't. SOL pilot: H₀ retained at the aggregate level, with strong family-level heterogeneity. DOGE + BTC replication in progress.
The hypothesis
The Daru Finance corpus page contains the following verbal claim: “smooth, broad ridges mean the family has a stable performance basin under perturbation of its parameters; sharp spikes flanked by collapse mean the family is brittle — its in-sample peaks are likely overfit rather than real.” That sentence has been printed on the corpus page for some time without an empirical test attached. This project is the test. It is pre-registered, the hyperparameters were locked before the replication data was loaded, and the writeup is committed to publishing whichever way the result lands.
Operationally: pick a strategy family F, an asset a, and a walk-forward window w. For every strategy θ in the IS top-K of (F, a, w), measure how much the in-sample Sharpe wiggles when the backtest is re-run under a fixed five-perturbation suite — entry-confirmation, fee, slippage, entry+indicator. The std of those five Sharpes is σmicro(θ). Aggregate to the cell level by averaging σmicro across the IS top-K. The OOS robustness metric RK is the mean base-perturbation OOS Sharpe of the same K=5 strategies in window w+1. The two formal definitions are:
H0: after controlling for family / asset / window fixed effects, βsmooth in RK ~ σmicro is zero or positive. H1 (one-sided): βsmooth < 0 — IS-stable strategies generalise better OOS, exactly as the verbal claim predicts. K = 5 matches the firm’s existing portfolio convention (CheckerWFO).
Pre-registration
Two commits matter. The first, 8e62171552f284b6dbbe6a5d450d41618935fd76, is the initial commit of analysis-plan.md and locks the hypothesis, the four sensitivity metrics (σmicro, σmicro,z, σparam, σcombined), the four robustness metrics (RK, RPT, Rρ, Rlift), the inclusion/exclusion rules, the estimator family, the permutation-null protocol, the BH multiplicity rule across the secondary 15 (σ, R) pairs, and the article-elevation thresholds. It was committed before any parquet was generated.
The second, 1badde4, is analysis-plan-locked.md: the post-pilot hyperparameter lock. It freezes the σparam neighbourhood definition, the |Sharpe| numerical-artifact cap, the OLS-with-family-clustered-SE estimator that replaces MixedLM, and the within-transform cross-check that replaces R’s lme4. It was committed after the SOL pilot finished and before any DOGE or BTC parquet existed; directory-mtime checks on data/ verify this.
The design is a pilot–replication split. The SOL_1h_7W partition is the pilot, used purely to lock hyperparameters that the pre-reg left open (e.g., the exact “1-step” rule for σparam and the clustering choice for the cluster-robust SE). The pilot has insufficient power to confirm anything (n_cells = 42, ≈ 0.55 power at f² = 0.05). Confirmation comes from the pooled DOGE_30m_21W + BTC_30m_27W replication, n_cells ≈ 322, > 0.95 power at f² = 0.05. The replication is run with the locked hyperparameters and no further tuning.
Methodology
The corpus is parsed in Rust: src/parse_corpus.rs walks each <asset>/<family>/<strat>/<strat>.txt tree, regex-extracts every Wxx line into a long-format row keyed by (asset, family, strategy, base_param, transformation, confluence, sl_regime, window, sample, perturbation), and emits a parquet per asset. The parser correctness is spot-checked by hand on five random text files per asset against the parquet rows.
The cell-level dataframe has one row per (family, asset, window). For each cell, σmicro is computed at the strategy level first and then averaged across the IS top-K, and RK is the mean base-OOS Sharpe of the same K=5 strategies one window later:
Inclusion: a cell is kept iff (i) window w+1 exists, (ii) ≥ 30 strategies in (F, a) have all five perturbation variants in window w, and (iii) at least K=5 of those have non-null IS Sharpe. The RSI_LEVEL family on SOL has only 12 strategies and is excluded from the primary cell-level model by the n ≥ 30 floor; it is reported in the strategy-level sensitivity instead.
Estimator and inference
The pre-registered model was a linear mixed-effects fit with random intercepts on family, asset, and window. In practice, at n_windows = 7 in the SOL pilot, the random-effects covariance for window was singular and MixedLM’s REML optimiser failed to converge. The locked replacement is OLS with family-clustered robust standard errors and explicit fixed effects on family, window, and asset:
with the cluster-robust sandwich estimator at the family level. Inference uses a permutation null with M = 1000: within (a, w+1) cells, shuffle the OOS-Sharpe → strategy assignment, recompute RK, refit, and record βsmooth,null. The empirical one-sided p-value is .
The pre-reg called for an R/lme4 cross-language verification. R is not installed in the environment and bringing it in would have broken the firm’s tri-language standard (Rust + Python + R kept available only when all three are already supported on the host). The locked substitute is a within-transform OLS implemented from scratch in Python: subtract group means for F / a / w, fit OLS on the residuals, and compute the analytic SEs. The two implementations must agree on βsmooth to three decimal places. On the SOL pilot they agreed to 1.1 × 10⁻¹³.
Documented deviations from pre-registration
Three deviations are reported honestly in the results, in the same place the pre-reg said they would be:
- |Sharpe| > 100 cap. The pre-reg sentinel rule excluded only
nanand±inf. After the parser ran, ≈ 1.4 × 10¹⁶-magnitude Sharpe values appeared on rows withtrades = 2: the backtester divides annualised return by a near-zero realised vol on those trades and produces numerical artifacts. The locked rule drops rows with |Sharpe| > 100. This is a documented exploratory deviation. - OLS + family-clustered SEs replacing MixedLM. Singular random-effects covariance at n_windows = 7 broke REML. The locked estimator preserves the cluster-robust intent (clusters at family) and reproduces the same fixed-effects structure as the pre-reg model.
- Within-transform OLS replacing R /
lme4cross-check. R unavailable; analytical demeaning was implemented in Python instead. The Δβ < 1e-13 agreement on the SOL pilot is the verification.
SOL pilot results
The pilot ran on SOL_1h_7W, the firm’s 7-window walk-forward partition on SOL/USDT 1h bars. After the n ≥ 30 strategies-per-cell floor, n_cells = 42 across 7 families × 6 transitions (RSI_LEVEL excluded). The primary fit is:
- βsmooth = +0.115 (standardised σmicro)
- SE = 0.121, t = +0.95
- pone-sided (H1: β < 0) = 0.829
- pperm (M = 1000) = 0.843
- f² = 0.038 (just below the 0.04 article-elevation floor)
- Cross-implementation: βwithin = +0.115, Δ = 1.1 × 10⁻¹³
At the aggregate level, the sign of βsmooth is the opposite of what H1predicts: on SOL_1h_7W, the IS-spikier cells, on average, slightly out-perform the IS-smoother cells in the next window. This is far from significant — the permutation null is centred at −0.004 with std 0.118, and the observed +0.115 sits comfortably inside it (pperm = 0.843). H0 is retained at the aggregate level on the pilot.


Per-family breakdown
Aggregating obscures real heterogeneity. Splitting by family and BH-correcting across the eight families gives a sharper picture:
- ATR: β = −0.33, pBH < 0.0001. Strongly hypothesis-supporting — within ATR, IS-smoother cells beat IS-spikier cells out-of-sample by a wide margin.
- RSI: β = −0.38, pBH = 0.28. Sign is right, magnitude is large, but n_cells inside RSI is small enough that BH does not reject.
- PPO, SMA: weakly negative β, neither close to BH significance.
- EMA, STOCHK, MACD: positive β. EMA and STOCHK are the two families that drive the aggregate sign-flip. Both are momentum-style families with shallower IS Sharpe surfaces; the σmicro = 0 region inside them seems to coincide with a flat-zero RK region rather than a high-RK ridge.
Read the pilot honestly: at the population level, smoothness does not predict OOS skill on SOL. At the family level, it does inside ATR (and probably RSI, given more cells), and it doesn’t inside the momentum families. The verbal corpus claim is, on this evidence, family-conditional rather than universal.
The surfaces themselves
Numbers settle the empirical question; the surfaces show what we are calling smooth and spiky. Below: the IS Sharpe landscape for each family, computed at the mid walk-forward window (W = 4) on SOL/USDT 1H, displayed over the (transformation × confluence) parameter grid. Each panel is annotated with σL (a discrete-Laplacian smoothness metric — lower is geometrically smoother) and the per-family βsmooth from the replication pool (negative is hypothesis-supporting). Sorted smooth → spiky.
Each panel: IS Sharpe surface over the (transformation × confluence) grid for one family on SOL/USDT 1H, mid walk-forward window, median over base-parameter and SL-regime slices. Camera orbits 360° in 20 s. σL = discrete-Laplacian smoothness (lower = smoother). β = per-family OLS slope of RK on σmicro from the replication pool (negative = hypothesis-supporting; positive = anti-hypothesis). Videos are lazy-played: decoding pauses when the section scrolls offscreen. Sorted smoothest → spikiest by σL.
The smoothest surfaces by σL are PPO (0.95) and ATR (1.12); the spikiest are EMA (1.68) and SMA (1.90). If the verbal corpus claim were a clean linear law, these σL rankings should track the per-family β. They don’t, exactly. ATR, the smoothest of the families with a large negative β, is on the hypothesised side; MACD and EMA, the families with the largest positive βs, are visibly spikier — also on the hypothesised side. But PPO is the smoothest surface and yet carries β = +0.16, and SMA is the spikiest yet carries β ≈ 0. The relationship between geometric smoothness at one window and predictive smoothness across all transitions is real but partial. That partiality is exactly what the H0-retained aggregate result reflects.

Replication on DOGE + BTC
The replication parses DOGE_30m_21W (Δt = 30m, 21 walk-forward windows) and BTC_30m_27W (30m, 27 windows) and pools them. The locked hyperparameters from 1badde4 apply unchanged: same five-perturbation suite, same K = 5, same n ≥ 30 floor, same |Sharpe| > 100 drop, same OLS + family-clustered SEs, same M = 1000 permutation null. Pooled n_cells ≈ 322; per-asset triangulation breaks out SOL, DOGE, and BTC separately.
Pooling DOGE (ncells = 140) and BTC (ncells = 182) yields βsmooth = −0.058 with cluster-robust SE 0.126, one-sided permutation pperm = 0.272 (M = 1000) and effect size f² = 0.0002. The sign now matches the hypothesis — for the first time across the three estimates run on this corpus — but the magnitude is small relative to the cross-cell residual variance and the result sits comfortably inside the null. The cross-implementation check between OLS and the within-transform diverges to |Δβ| = 0.016 in the larger sample (vs 1×10−13 on the pilot); the gap is driven by the contrast coding in the unbalanced asset / window design, and both estimators agree on sign and order of magnitude.
The per-asset triangulation is where the picture becomes unambiguous. SOL β = +0.115 (n = 42); DOGE β = +0.163 (n = 140); BTC β = −0.210 (n = 182). The sign flips between DOGE and BTC, and neither is significant on its own. There is no coherent population-level relationship here. The pooled β < 0 is the weighted algebraic average of a positive bet and a negative bet; it is not evidence for the hypothesis.

The per-family breakdown on the replication pool (BH correction within the family of seven secondary tests; RSI_LEVEL is excluded by the n ≥ 30 floor) tells a more nuanced story. ATR β = −0.850 (uncorrected p = 0.044, pBH = 0.310) — the single largest within-family negative slope on the entire study, and the only family with a replicating sign across pilot and replication. RSI β = −0.090 (pBH = 0.888); EMA, MACD, PPO, SMA, STOCHK all positive (β between +0.011 and +0.631; all pBH = 0.888). After BH correction nothing crosses α = 0.05, but the structural finding — ATR (and weakly RSI) behaves as the verbal claim predicts; momentum families behave the opposite way — is consistent across pilot and replication.
Article-elevation criterion (locked, evaluated)
Pre-reg required all three of:
- replication pperm < 0.01 — got 0.272. Fails.
- sign-consistent βsmooth < 0 across SOL / DOGE / BTC — got +0.115, +0.163, −0.210. Fails.
- f² ≥ 0.04 — got 0.0002. Fails.
The criterion is missed on all three thresholds simultaneously. The result does not elevate to an article; the corpus-page sentence is rewritten per pre-reg branch 3 below.
What this means for the corpus claim
Three branches were pre-specified, with the rewriting plan for /corpus committed in advance:
- Strong support (replication pperm < 0.01, β < 0, f² ≥ 0.04). The corpus sentence keeps its current form and gains a footnote linking here. A standalone article on the family-conditional structure (ATR/RSI vs momentum families) is published.
- Weak support (replication pperm < 0.05, sign-consistent, f² < 0.04 or one of the three criteria misses). The corpus sentence is rewritten to: “within mean-reversion families, smooth in-sample ridges show weak evidence of generalising better OOS than sharp spikes; momentum families do not show this pattern, and the population-level claim is not supported.” Lab page only; no separate article.
- Null / sign-flipped (replication NS or sign-consistent positive). The corpus sentence loses its absolutist phrasing and is replaced with: “in-sample smoothness under microstructural perturbation is not a population-level predictor of out-of-sample skill on the corpus we have tested. Family-level analysis is required.” The lab page reports the null cleanly and points the reader at the per-family heterogeneity figure.
The pilot already gives reason to expect the writeup will land somewhere between the second and third branch. ATR is real; the population-level claim probably is not. The replication’s job is to say which of the two is the headline.
Live demos
The 3D surface, live
The MP4s above are baked-in views of seven specific corpus surfaces. The demo below lets you shape a surface yourself: dial the smoothness slider toward 1 to get a wide bell (the “ATR look”), drag it toward 0 to get a sharp narrow peak surrounded by collapse (the “MACD look”). The perturbation-noise slider adds high-frequency jitter, the empirical analogue of σmicro. The σL readout updates in real time. The mesh is intentionally low-poly (32×32 = 1,024 vertices, single material, one wireframe pass) so that even a weak phone can rotate it at 60 fps without breaking a sweat.
Synthetic IS Sharpe surface, 32×32 mesh, vertex-coloured by height. The smoothness slider interpolates between a wide low bell (smooth basin under perturbation — hypothesised to generalise OOS) and a narrow tall spike with flat collapse around it (brittle peak — hypothesised to overfit). The perturbation noise slider adds high-frequency jitter that approximates what σmicro measures empirically. σL in the panel updates live so you can see how the geometry and the smoothness metric move together.
The (σ, R) scatter, live
Different lens, same project. This second demo lets you scrub the simulated relationship between σmicro and RK for a single family under a tunable underlying β and sample size. It is not a fit to the corpus — the corpus result is in the figures above and in the replication parquet — but it builds intuition for what βsmooth = ±0.3 actually looks like as a scatter, and how easy it is to be fooled by 42 noisy cells.
Synthetic. The dashed grey line is the true β; the amber line is the fitted slope on this sample. Resample to see how often the sign flips at small n. The pre-registered article criterion was f² ≥ 0.04 — try setting β = −0.3 with n = 42 vs n = 322 and watch how visibility-of-effect changes. The corpus pilot was n = 42; the replication pool was n = 322.
Reproducibility
DaruFinance / quant-surface-stability
Python — open source reference implementation
Minimal invocation
# Reproduce: pre-reg locked at 8e62171; hyperparameter lock at 1badde4
git clone https://github.com/DaruFinance/quant-surface-stability.git
cd quant-surface-stability
# 1. Parse strategy text dumps -> parquet (Rust, ~1m per asset)
cargo run --release --bin parse_corpus -- --asset SOL_1h_7W
# 2. Build cell-level (family x asset x window) metrics
python scripts/compute_metrics.py --asset SOL_1h_7W
# 3. Primary fit: sigma_micro x R_K, OLS + family-clustered SEs,
# permutation null with M=1000, within-transform cross-check.
python scripts/fit.py --asset SOL_1h_7W --primary
# beta_smooth = +0.115 p_perm = 0.843 f^2 = 0.038
# cross-impl. delta = 1.1e-13 --> H_0 retained at aggregate level
References
- [1]Bailey, D. H. & López de Prado, M. (2014). The probability of backtest overfitting. Journal of Computational Finance 20(4), 39–69.
- [2]Harvey, C. R. & Liu, Y. (2014). Backtesting. Journal of Portfolio Management 42(1), 13–28.
- [3]Carrasco, M. & Maciel, L. (2020). Robustness of in-sample optimisation in trading rule selection: a parameter-stability perspective. Quantitative Finance 20(11), 1799–1816.
- [4]Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57(1), 289–300.