M/05 · Strategy Manifold
PCA + UMAP geometry of the strategy population
Embed 100,000 walk-forward strategies from 90-D metric space into 2-D, then ask whether robustness is a contiguous region or a constellation of isolated islands.
The mathematics
Each strategy is a point x ∈ ℝᵈ in a high-dimensional feature space whose components are per-window (Sharpe, PF, MaxDD) across the last 6 walk-forward windows, evaluated under Daru Finance’s proprietary perturbation suite. With N = 100,000 strategies we have an N × d matrix X. We ask: in this space, do robust strategies cluster, and if so does the cluster form a single connected region or many isolated islands?
Principal-component projection
Centre the columns of X and take the singular value decomposition:
is the linear projection onto the leading two principal directions. On the production corpus PC1 explains 16.6% of the variance and PC2 explains 10.0%, i.e. the embedding is intrinsically low-dimensional, two coordinates already capture about a quarter of the signal in 90.
UMAP as a non-linear lens
PCA preserves global geometry but ignores neighbourhoods. UMAP (McInnes & Healy 2018) fits a fuzzy simplicial set μ in input space (each k-NN edge weighted by a local Riemannian metric) and a corresponding set ν in 2-D, then minimises their cross-entropy
Under continuity assumptions on a Riemannian uniform manifold this preserves local topology while tearing global geometry. The result is a layout that surfaces tight neighbourhoods PCA flattens.
Connectivity and modularity
Build the k-NN graph (k = 15) on the 2-D embedding. Let A be its adjacency matrix, k_i the degree of node i, m the edge count, and c_i a label in {robust, fragile}. Newman’s modularity
measures how much edge density inside each label exceeds the configuration-model null. We use a cheap proxy Q̃ = fraction of edges whose endpoints share a label. The chance baseline is r̄² + (1−r̄)²; with r̄ = 0.0687 that gives 0.872. The empirical value on the production embedding is Q̃ ≈ 0.903, a real but small lift over chance, consistent with weak clustering.
Worked example
- 10 deepest-WFO assets, 100,000-strategy stratified subsample, 6-window metrics feature.
- Robustness rate
r̄= 6.87% → 6,869 robust vs 93,131 fragile. - UMAP modularity proxy 0.903 (vs 0.872 baseline); PCA proxy 0.911.
- Number of connected components in the robust-only k-NN subgraph: 1,229 for UMAP and 938 for PCA. Average ≈ 5.6 robust strategies per island.
The interactive demo below reads the measured connectivity statistics straight from the committed analysis. Switch between corpora and embeddings to see how the robust subset fragments at each scale.
Demo: synthetic strategy manifold
N points sampled from one diffuse fragile cloud + K tight robust islands. Sweep the robustness threshold τ; watch how the robust subset partitions.
Amber: r ≥ τ (robust). Grey: r < τ (fragile). Connectivity is computed on an 8-NN graph over the amber subset using union-find. Lift over baseline = +0.027. With the production corpus (N=100,000, real metrics) the analogous numbers are 6,869 robust points, ~1,229 components, Q̃ ≈ 0.903.
Figures
Why this matters for systematic strategies
Many search procedures over strategy space, gradient ascent on a smoothed score, evolutionary crossover, Bayesian optimisation, implicitly assume the robust region is locally convex: that small perturbations of a robust strategy stay robust. The connectivity analysis directly contradicts that assumption. The robust population is not a single connected manifold at any scale we’ve checked; it is a constellation of ~5–8-strategy islands separated by fragile gaps.
Operationally, two consequences. First, edge cannot be reliably reached by perturbing a known good strategy, the neighbours of a robust strategy are fragile with high probability. Second, the robust subset must be enumerated combinatorially over the indicator/transform/confluence grid, not recovered by local search. The pipeline that downstream models consume already respects this: candidates are generated combinatorially and only then filtered, never optimised toward.
Reproducibility
DaruFinance / strategy-manifold
Python · open source reference implementation
Minimal invocation
import numpy as np
from sklearn.decomposition import PCA
from sklearn.neighbors import kneighbors_graph
import umap
# X: N x d feature matrix (rows = strategies, cols = per-window metrics).
# r: length-N {0,1} vector, 1 = passed the proprietary robustness funnel.
N, d = X.shape
# 1) Linear baseline.
pca = PCA(n_components=2).fit(X)
T_pca = pca.transform(X)
print("PC1 var:", pca.explained_variance_ratio_[0])
# 2) Non-linear embedding.
emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=0).fit_transform(X)
# 3) Connectivity of the robust subset under the 15-NN graph.
A = kneighbors_graph(emb, n_neighbors=15, mode="connectivity")
A_robust = A[r == 1][:, r == 1]
from scipy.sparse.csgraph import connected_components
n_components, _ = connected_components(A_robust, directed=False)
print("robust components:", n_components)
References
- [1]Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(11), 559–572.
- [2]McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.
- [3]Newman, M. E. J. (2006). Modularity and community structure in networks. PNAS 103(23), 8577–8582.

