The corpus

What “1.6 million strategies” actually means.

A response to the most common — and most reasonable — first reaction:“that’s not research, that’s data mining.”

The objection has teeth. With enough degrees of freedom, anything can be fit to anything. So before any claim is made about portfolios, edge, or robustness filters, this page does the boring work: explains how the corpus is constructed (with the confidential parts left confidential), shows what its internal geometry looks like, and walks through the two visual artefacts that capture both faces of the population — family-level structure and variant-level structure.

Corpus dimensions

Asset / timeframe partitions

crypto majors, large, mid-cap, FX

Indicator families

EMA · SMA · MACD · PPO · RSI · RSI_LEVEL · STOCHK · ATR

1.63M

Strategies in full corpus

1,632,588 — counted, not estimated

23M+

Strategy-windows backtested

with IS / OOS samples + perturbations

The objection

The data-mining critique runs roughly as follows: if you backtest a million strategies, the right tail of the distribution will, by sheer luck, contain strategies that print huge profit factors in-sample. Cherry-pick those, and any backtest looks great — until live trading, where the fits evaporate.

That critique is correct in spirit, and in fact this work agrees with it at the strategy level. The article Edge is in the Process opens with that finding: across the full 30-asset corpus, fewer than one in seven strategies survive an honest out-of-sample profit-factor threshold of 1.2, and the IS→OOS rank correlation hovers near zero. No individual strategy in this corpus carries credible OOS edge.

So what is the population for? The argument develops over the next two sections, and the two video sections after them are the visual evidence.

How the corpus is built

Every strategy in the corpus is a fully specified, deterministic rule: an indicator family, a parameterisation, an entry / exit logic, and a risk-reward regime. Nothing is mined. Nothing is hand-picked. The substrate is the structured product of:

30 asset / timeframe partitions — crypto majors (BTC, ETH, BNB, SOL), large altcoins (DOGE, XRP, LTC, AVAX, LINK, BCH, DOT, TRX), ~15 mid-caps (APE, ATOM, ICP, NEAR, ALGO, HBAR, ARB, SUI, APT, ZEC, …) and three forex pairs (AUDUSD, USDCAD, USDCHF). Forex is held out as a cross-asset-class robustness check.
Eight indicator families covering trend, momentum, mean-reversion, and volatility — EMA, SMA, MACD, PPO, RSI, RSI level cross, Stochastic %K, ATR.
Per-family variant populations. Each base family is expanded into a population of variants by a battery of statistical and mathematical functions — transformations applied to the raw indicator series and confluence subsets that condition entries on a second signal in a structured way. The exact construction is reserved. The empirical artefacts on this page are the demonstration that whatever the construction is, the resulting population has the geometric properties claimed.
Walk-forward optimisation across multiple lookback horizons, so each strategy is evaluated under sliding IS / OOS windows rather than on a single arbitrary slice of price history.

No subset is removed. No partition is “held out as the good one.” All thirty asset/timeframes are evaluated under the same pipeline, with the same code, from the same commits.

On reproducibility. The SSRN working paper ships a reproducibility package — but it reproduces the aggregates derived from the strategies that appear in the paper: the IS / OOS distributions, filter-curve diagnostics, portfolio statistics, and figure data. The strategy corpus itself is proprietary and is not released. Python, Rust, and R cross-verify the same published aggregates from the same raw price history; they do not regenerate the corpus.

For population-level reading of the same substrate using public methodology, see the lab notes on population Sharpe density and the IS–OOS Sharpe-gap decomposition.

The geometry, in one chart

The single most informative thing you can do with a strategy corpus is compute its pairwise correlation matrix. If “a million strategies” were really one strategy in disguise, that matrix would be saturated near ρ = 1. It isn’t. The histogram below is the distribution of pairwise correlations across nearly 30,000 strategies sampled stratified across the 30 asset partitions — about 450 million pairs.

LOADING CORPUS STATISTICS…

For a Marchenko–Pastur reading of the same matrix — how many of the eigenvalues are signal vs the random-matrix null — see strategy-rmt. For the cross-asset analogue (rolling correlation cubes across the 9-asset universe), see strategy-corrcube.

Effective dimensionality

Mean absolute correlation across the full pool is 0.123. Median 0.072. Only 2.84% of pairs cross |ρ| = 0.5. The within-asset average (0.163) is — as you would expect — slightly higher than the between-asset average (0.122), but neither is anywhere near the “all the same trade” ceiling.

In linear-algebra terms, the corpus has very high effective dimensionality: the population spans many more nearly-orthogonal return directions than its surface size suggests. A portfolio drawn from the pool therefore has access to a far richer spanning set than a corpus where every strategy is a re-skin of the same idea. Two empirical consequences follow directly:

Individual strategies look like noise (large idiosyncratic variance, weak IS→OOS predictability) precisely because each one captures a small, distinctive slice of return space.
Portfolios drawn from the same pool behave very differently — their idiosyncratic terms cancel and the structural component survives. This is the empirical hinge of Edge is in the Process, and the universe-saturation behaviour of those portfolios is studied separately in strategy-robust-portfolio.

For the formal embedding view of the same population — PCA + UMAP on a 90-feature metric vector — see strategy-manifold. The two video sections below are the visual demonstrations on a single asset, SOL/USDT 1H, using the publicly disclosed indicator families.

Family level — performance surfaces

The eight base families, as 3D performance landscapes.

One asset — SOL/USDT, 1H bars. For each indicator family, every parameter combination of the family is evaluated under the same walk-forward pipeline, and the resulting performance surface is rendered as a 3D landscape. These are the raw, unoptimised surfaces — no filter applied, no post-selection, no cherry-picking. What you see below is the population each family generates before any of the firm’s methodology is brought to bear.

synced · max drift 0 ms

EMA

SMA

MACD

PPO

RSI

RSI_LEVEL

STOCHK

ATR

How to read a surface

In each surface, the two horizontal axes (X and Y) are strategy parameters — the configurable knobs that define a strategy within its family. The vertical axis (Z) is strategy performance, summarised here as Sharpe ratio over the in-sample window that walk-forward selects on. The camera orbits so you can see the surface from every angle.

What matters is the shape, not any single peak. The intuitive reading — smooth ridges = stable basin, sharp spikes = brittle and likely overfit — is empirically tested in strategy-surface-stability, a pre-registered study on this corpus. The headline result is negative: in-sample smoothness under microstructural perturbation is not a population-level predictor of next-window OOS Sharpe across the three asset partitions tested (SOL 1h, DOGE 30m, BTC 30m). The pooled effect is indistinguishable from zero (f² = 0.0002), and the per-asset sign flips between DOGE and BTC. ATR is the one family where the pattern does hold consistently across pilot and replication; momentum families behave the opposite way. Strategy-level Sharpe-gap decomposition is the subject of strategy-overfitting; what we want to communicate here is just the visual fact that the eight families produce eight visibly different geometries on the same asset — whether that geometry predicts OOS skill is a separate empirical question, with a separate (and largely null) answer.

The natural follow-up question is whether two families that look similar in shape are in fact the same trade in disguise. The figure below answers that.

Pairwise correlation · 8 × 8

And the correlation between them

The 8 × 8 panel above the surfaces is the rolling pairwise correlation of each family’s entry events on the same SOL/USDT 1H slice. Each family is reduced to a single time-series — long = +1, short = −1, otherwise 0 — and the heatmap is the pairwise Pearson correlation of those event series over a rolling window. The diagonal is 1 by construction; the off-diagonal cells are what we care about.

The point of frame-locking the correlation video to the eight surface videos is that you can watch both at once: as the price walks through the slice, the surface clouds deform and the correlation cells shift. Two families that share a regime briefly will glow on the heatmap; two families that have structurally different timing logic will stay dim, no matter what the surfaces look like. The cells stay overwhelmingly cool — mean correlation across the off-diagonal stays near zero through the slice. The eight families are not redundant. The cross-asset version of this idea — the same 9-asset rolling-correlation cube — is the whole subject of strategy-corrcube.

Variant level — population structure inside each family

Inside each family — the variant population.

The previous section showed a family as a single performance surface. This section opens each family up. Per family, the proprietary battery of statistical and mathematical functions expands the base indicator into a structured population of variants — broken into two complementary subsets, raw indicator + transformation and raw indicator + confluence. Each video below is a live PCA layout of one family’s variant population on SOL/USDT 1H, with edges drawn between strategies that move together.

synced · max drift 0 ms

EMA

SMA

MACD

PPO

RSI

RSI_LEVEL

STOCHK

ATR

How to read a variant graph

The chart mechanics, made explicit:

Nodes = strategies. Each node is a single variant produced by the family’s expansion (transformation- or confluence-side). Nothing in the layout encodes the variant’s construction; only its behaviour.
Position = the top-2 principal-component loadings of the rolling 1,000-candle correlation matrix between every pair of variants in the family. Concretely: at each frame we form C = corr(R_t-999:t) where R are per-variant entry-event series, take its top two eigenvectors, and place each variant at its loading. The rolling window is then advanced one bar and the layout recomputed.
Procrustes alignment + EMA smoothing on the loadings across frames. Without this, every PCA refit can flip the basis and rotate the cloud arbitrarily; with it, the cloud drifts rather than jumps, and the visual motion reflects actual structural change in the correlation manifold rather than bookkeeping artefacts.
Edges connect pairs with |ρ| ≥ 0.20. Teal = positive, red = negative; thickness and opacity scale with |ρ|. The threshold is a knob — lower it and the graph becomes a hairball, raise it and only the strongest co-movers remain.
Side panel = three time-series tracking the population’s cohesion: AVG |ρ|, MEDIAN |ρ|, and the percentage of pairs above threshold. Watch them rise during regime-coherent stretches and fall again when the regime breaks.

The topology lens: each frame’s correlation matrix can be turned into a metric (d_ij = √(2(1−ρ_ij))), and that metric induces a Vietoris–Rips complex over the variant population. Nodes that cluster on the chart correspond to 0-cycles that persist across filtration scales; the appearance and dissolution of 1-cycles tracks regime transitions in the family. The persistence-barcode treatment of the same machinery — on the full corpus rather than one family — lives in strategy-tda; the formal PCA + UMAP embedding using a 90-feature metric vector is in strategy-manifold.

Important to read these graphs correctly: the layout is a 2D projection of a much higher-dimensional correlation manifold. Two nodes that look close in the projection are not necessarily close in the full space; they are close along the two directions of largest variance. Edges are the truth-keeper — an edge says the two variants actually co-move in the underlying space, regardless of where they sit on the screen.

Rendering and analysis scripts behind these clips live in the firm’s internal SOL strategies project; the proprietary parts (the variant-construction battery in particular) are not in the public portion of the toolchain.

All families overlaid

All families, overlaid

The combined panel above is every variant from every visualised family, placed in one common 2D layout. Each family carries a colour. The point is not that one family wins — the point is that the families occupy distinct regions of the projected correlation manifold. Variants from different families do not pool into a single blob; they keep to their own neighbourhoods, with brief overlaps during regime crossovers.

That separation is what high effective dimensionality looks like at the level of composition, not just of average correlation. The pool isn’t merely uncorrelated on average; it is structurally heterogeneous. A robustness filter sampling across families is therefore drawing from genuinely different sub-manifolds, not slicing a single homogeneous cloud at different angles. That is what makes population-level evaluation a different problem from individual-strategy search.

In short

“Thousands of strategies” is not, by itself, a research claim. It is the size of an evaluation substrate. The substrate is constructed exhaustively — deterministic, no cherry-picking — and its internal geometry is dominated by low pairwise correlation, not redundancy. The 3D surface section showed that the eight base families are not the same family eight times. The variant-graph section showed that within a family the variant population is itself structured, and that across families the structures stay separated. That combination is what makes population-level evaluation a different problem from individual-strategy search, and what makes the portfolio-level findings of Edge is in the Process non-trivial.

Lab notes that contribute to or extend the analysis on this page:

strategy-rmt — Marchenko–Pastur and parallel-analysis eigenspectrum of the correlation matrix, the formal version of the histogram above.
strategy-manifold — PCA + UMAP embedding of the population from a 90-feature metric vector.
strategy-tda — persistence barcodes on the same correlation-distance complex, but on the full corpus.
strategy-corrcube — cross-asset rolling correlation cube; the asset-level analogue of the variant-level structure shown here.
strategy-stats — population-level Sharpe density and (μ, σ, t) scatter; the supporting library behind the dimension counters above.
strategy-overfitting — variance decomposition of the IS–OOS Sharpe gap into selection bias, parameter-choice noise, and residual.
strategy-robust-portfolio — universe-saturation analysis for portfolios drawn from the same pool.

If you remain unconvinced, the SSRN reproducibility package ships every aggregate and every figure from the working paper. Run it — and remember that what reproduces is the analysis that operates on the corpus, not the corpus itself.