Lab

M/05 · Strategy Manifold

PCA + UMAP geometry of the strategy population

Embed 100,000 walk-forward strategies from 90-D metric space into 2-D, then ask whether robustness is a contiguous region or a constellation of isolated islands.

The mathematics

Each strategy is a point x ∈ ℝᵈ in a high-dimensional feature space whose components are per-window (Sharpe, PF, MaxDD) across the last 6 walk-forward windows, evaluated under Daru Finance’s proprietary perturbation suite. With N = 100,000 strategies we have an N × d matrix X. We ask: in this space, do robust strategies cluster, and if so does the cluster form a single connected region or many isolated islands?

Principal-component projection

Centre the columns of X and take the singular value decomposition:

is the linear projection onto the leading two principal directions. On the production corpus PC1 explains 16.6% of the variance and PC2 explains 10.0%, i.e. the embedding is intrinsically low-dimensional, two coordinates already capture about a quarter of the signal in 90.

UMAP as a non-linear lens

PCA preserves global geometry but ignores neighbourhoods. UMAP (McInnes & Healy 2018) fits a fuzzy simplicial set μ in input space (each k-NN edge weighted by a local Riemannian metric) and a corresponding set ν in 2-D, then minimises their cross-entropy

Under continuity assumptions on a Riemannian uniform manifold this preserves local topology while tearing global geometry. The result is a layout that surfaces tight neighbourhoods PCA flattens.

Connectivity and modularity

Build the k-NN graph (k = 15) on the 2-D embedding. Let A be its adjacency matrix, k_i the degree of node i, m the edge count, and c_i a label in {robust, fragile}. Newman’s modularity

measures how much edge density inside each label exceeds the configuration-model null. We use a cheap proxy = fraction of edges whose endpoints share a label. The chance baseline is r̄² + (1−r̄)²; with = 0.0687 that gives 0.872. The empirical value on the production embedding is Q̃ ≈ 0.903, a real but small lift over chance, consistent with weak clustering.

Worked example

  • 10 deepest-WFO assets, 100,000-strategy stratified subsample, 6-window metrics feature.
  • Robustness rate = 6.87% → 6,869 robust vs 93,131 fragile.
  • UMAP modularity proxy 0.903 (vs 0.872 baseline); PCA proxy 0.911.
  • Number of connected components in the robust-only k-NN subgraph: 1,229 for UMAP and 938 for PCA. Average ≈ 5.6 robust strategies per island.

The interactive demo below reads the measured connectivity statistics straight from the committed analysis. Switch between corpora and embeddings to see how the robust subset fragments at each scale.

Demo: synthetic strategy manifold

N points sampled from one diffuse fragile cloud + K tight robust islands. Sweep the robustness threshold τ; watch how the robust subset partitions.

N (points)800
K (planted islands)40
τ (robustness threshold)0.70
seed=11
# above τ
80
rate above τ
10.00%
components
27
strats / island
3.0
modularity Q̃
0.847
chance baseline
0.820
embedding axis 1embedding axis 2

Amber: r ≥ τ (robust). Grey: r < τ (fragile). Connectivity is computed on an 8-NN graph over the amber subset using union-find. Lift over baseline = +0.027. With the production corpus (N=100,000, real metrics) the analogous numbers are 6,869 robust points, ~1,229 components, Q̃ ≈ 0.903.

Figures

Fig. 1:UMAP embedding of the 10-asset, 100,000-strategy corpus. Robust strategies (blue) do not coalesce into one region, they appear as small dense pockets distributed across the fragile mass.
Fig. 2:Same population in the linear PC1–PC2 plane. PC1 explains 16.6%, PC2 10.0%. The robust signal is more diffuse than under UMAP, local structure is what carries the cluster information.
Fig. 3:Histogram of robust connected-component sizes. One large component (~2,400 strategies, ~35% of the robust set) coexists with ~1,200 smaller islands, most of size 1–10. The long tail is the headline.
Fig. 4:Per-asset run on ALGO 30m 6W (15,000 strategies, 1,765 robust): the same fragmentation pattern reproduces, 213 components, ~8 robust strategies per island.
Fig. 5:Synthetic three-blob smoke test. When robustness IS a contiguous mode in feature space, by construction, the connectivity measure correctly returns a small number of large components. The fragmentation observed on real strategies is not an artefact of the embedding.

Why this matters for systematic strategies

Many search procedures over strategy space, gradient ascent on a smoothed score, evolutionary crossover, Bayesian optimisation, implicitly assume the robust region is locally convex: that small perturbations of a robust strategy stay robust. The connectivity analysis directly contradicts that assumption. The robust population is not a single connected manifold at any scale we’ve checked; it is a constellation of ~5–8-strategy islands separated by fragile gaps.

Operationally, two consequences. First, edge cannot be reliably reached by perturbing a known good strategy, the neighbours of a robust strategy are fragile with high probability. Second, the robust subset must be enumerated combinatorially over the indicator/transform/confluence grid, not recovered by local search. The pipeline that downstream models consume already respects this: candidates are generated combinatorially and only then filtered, never optimised toward.

Reproducibility

DaruFinance / strategy-manifold

Python · open source reference implementation

Minimal invocation

import numpy as np
from sklearn.decomposition import PCA
from sklearn.neighbors import kneighbors_graph
import umap

# X: N x d feature matrix (rows = strategies, cols = per-window metrics).
# r: length-N {0,1} vector, 1 = passed the proprietary robustness funnel.
N, d = X.shape

# 1) Linear baseline.
pca = PCA(n_components=2).fit(X)
T_pca = pca.transform(X)
print("PC1 var:", pca.explained_variance_ratio_[0])

# 2) Non-linear embedding.
emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=0).fit_transform(X)

# 3) Connectivity of the robust subset under the 15-NN graph.
A = kneighbors_graph(emb, n_neighbors=15, mode="connectivity")
A_robust = A[r == 1][:, r == 1]
from scipy.sparse.csgraph import connected_components
n_components, _ = connected_components(A_robust, directed=False)
print("robust components:", n_components)

References

  1. [1]Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(11), 559–572.
  2. [2]McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.
  3. [3]Newman, M. E. J. (2006). Modularity and community structure in networks. PNAS 103(23), 8577–8582.
PCA + UMAP geometry of the strategy population | Daru Finance