M/05 — Strategy Manifold

PCA + UMAP geometry of the strategy population

Embed 100,000 walk-forward strategies from 90-D metric space into 2-D, then ask whether robustness is a contiguous region or a constellation of isolated islands.

The mathematics

Each strategy is a point x ∈ ℝᵈ in a high-dimensional feature space whose components are per-window (Sharpe, PF, MaxDD) across the last 6 walk-forward windows, evaluated under Daru Finance’s proprietary perturbation suite. With N = 100,000 strategies we have an N × d matrix X. We ask: in this space, do robust strategies cluster, and if so does the cluster form a single connected region or many isolated islands?

Principal-component projection

Centre the columns of X and take the singular value decomposition:

X = U Σ V^{⊤}, T = X V \in R^{N \times 2}

is the linear projection onto the leading two principal directions. On the production corpus PC1 explains 16.6% of the variance and PC2 explains 10.0% — i.e. the embedding is intrinsically low-dimensional, two coordinates already capture about a quarter of the signal in 90.

UMAP as a non-linear lens

PCA preserves global geometry but ignores neighbourhoods. UMAP (McInnes & Healy 2018) fits a fuzzy simplicial set μ in input space (each k-NN edge weighted by a local Riemannian metric) and a corresponding set ν in 2-D, then minimises their cross-entropy

CE (μ, ν) = i, j \sum μ_{ij} lo g \frac{μ _{ij}}{ν _{ij}} + (1 - μ_{ij}) lo g \frac{1 - μ _{ij}}{1 - ν _{ij}} .

Under continuity assumptions on a Riemannian uniform manifold this preserves local topology while tearing global geometry. The result is a layout that surfaces tight neighbourhoods PCA flattens.

Connectivity and modularity

Build the k-NN graph (k = 15) on the 2-D embedding. Let A be its adjacency matrix, k_i the degree of node i, m the edge count, and c_i a label in {robust, fragile}. Newman’s modularity

Q = \frac{1}{2 m} ij \sum [A_{ij} - \frac{k _{i} k _{j}}{2 m}] δ (c_{i}, c_{j})

measures how much edge density inside each label exceeds the configuration-model null. We use a cheap proxy Q̃ = fraction of edges whose endpoints share a label. The chance baseline is r̄² + (1−r̄)²; with r̄ = 0.0687 that gives 0.872. The empirical value on the production embedding is Q̃ ≈ 0.903 — a real but small lift over chance, consistent with weak clustering.

Worked example

10 deepest-WFO assets, 100,000-strategy stratified subsample, 6-window metrics feature.
Robustness rate r̄ = 6.87% → 6,869 robust vs 93,131 fragile.
UMAP modularity proxy 0.903 (vs 0.872 baseline); PCA proxy 0.911.
Number of connected components in the robust-only k-NN subgraph: 1,229 for UMAP and 938 for PCA. Average ≈ 5.6 robust strategies per island.

The interactive demo below recomputes the connectivity statistic every time you move the τ slider — drag it down and the robust set merges; drag it up and it shatters.

Demo — synthetic strategy manifold

N points sampled from one diffuse fragile cloud + K tight robust islands. Sweep the robustness threshold τ; watch how the robust subset partitions.

N (points)800

K (planted islands)40

τ (robustness threshold)0.70

seed=11

# above τ

rate above τ

10.00%

components

strats / island

3.0

modularity Q̃

0.847

chance baseline

0.820

Amber: r ≥ τ (robust). Grey: r < τ (fragile). Connectivity is computed on an 8-NN graph over the amber subset using union-find. Lift over baseline = +0.027. With the production corpus (N=100,000, real metrics) the analogous numbers are 6,869 robust points, ~1,229 components, Q̃ ≈ 0.903.

Figures

UMAP embedding of 100,000 strategies — Fig. 1 —UMAP embedding of the 10-asset, 100,000-strategy corpus. Robust strategies (amber) do not coalesce into one region — they appear as small dense pockets distributed across the fragile mass.

PCA embedding of 100,000 strategies — Fig. 2 —Same population in the linear PC1–PC2 plane. PC1 explains 16.6%, PC2 10.0%. The robust signal is more diffuse than under UMAP — local structure is what carries the cluster information.

Distribution of robust connected-component sizes — Fig. 3 —Histogram of robust connected-component sizes. One large component (~2,400 strategies, ~35% of the robust set) coexists with ~1,200 smaller islands, most of size 1–10. The long tail is the headline.

UMAP for ALGO single-asset corpus — Fig. 4 —Per-asset run on ALGO 30m 6W (15,000 strategies, 1,765 robust): the same fragmentation pattern reproduces — 213 components, ~8 robust strategies per island.

UMAP synthetic three-blob test — Fig. 5 —Synthetic three-blob smoke test. When robustness IS a contiguous mode in feature space — by construction — the connectivity measure correctly returns a small number of large components. The fragmentation observed on real strategies is not an artefact of the embedding.

Why this matters for systematic strategies

Many search procedures over strategy space — gradient ascent on a smoothed score, evolutionary crossover, Bayesian optimisation — implicitly assume the robust region is locally convex: that small perturbations of a robust strategy stay robust. The connectivity analysis directly contradicts that assumption. The robust population is not a single connected manifold at any scale we’ve checked; it is a constellation of ~5–8-strategy islands separated by fragile gaps.

Operationally, two consequences. First, edge cannot be reliably reached by perturbing a known good strategy — the neighbours of a robust strategy are fragile with high probability. Second, the robust subset must be enumerated combinatorially over the indicator/transform/confluence grid, not recovered by local search. The pipeline that downstream models consume already respects this: candidates are generated combinatorially and only then filtered, never optimised toward.

Reproducibility

DaruFinance / strategy-manifold

Python — open source reference implementation

Minimal invocation

import numpy as np
from sklearn.decomposition import PCA
from sklearn.neighbors import kneighbors_graph
import umap

# X: N x d feature matrix (rows = strategies, cols = per-window metrics).
# r: length-N {0,1} vector — 1 = passed the proprietary robustness funnel.
N, d = X.shape

# 1) Linear baseline.
pca = PCA(n_components=2).fit(X)
T_pca = pca.transform(X)
print("PC1 var:", pca.explained_variance_ratio_[0])

# 2) Non-linear embedding.
emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=0).fit_transform(X)

# 3) Connectivity of the robust subset under the 15-NN graph.
A = kneighbors_graph(emb, n_neighbors=15, mode="connectivity")
A_robust = A[r == 1][:, r == 1]
from scipy.sparse.csgraph import connected_components
n_components, _ = connected_components(A_robust, directed=False)
print("robust components:", n_components)

References

[1]Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(11), 559–572.
[2]McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.
[3]Newman, M. E. J. (2006). Modularity and community structure in networks. PNAS 103(23), 8577–8582.

All projects View on GitHub