Project A · Research program · 14–18 months

Topological Data Analysis in Quantitative Trading

The deepest applied topological treatment of strategy selection and portfolio construction undertaken in public, 11 pre-registered hypotheses, four advanced TDA paradigms, validated against a full modern-ML baseline stack.

Phased roadmap

7 macro phases

Phase 01in progress
Foundations & pilot
Bootstrap the public repo, lock operational definitions, build the strategy-space + behavior-space pipeline, run the 3-asset pilot.
Phase 02planned
Geometric analysis
Distances, density, manifold structure, plateau detection. Tests H1 (neighborhood generalization) and H2 (plateau hypothesis).
Phase 03planned
Persistent topology
Single-filtration PH, multipersistence, zigzag, sheaves. Tests H3, H6, H7, H8, the four advanced TDA paradigms.
Phase 04planned
Local homology + Fisher–Rao
Per-strategy local PH signatures + information-geometric distances. Tests H10 and H11.
Phase 05planned
Topo-aware portfolios
Mapper-cluster portfolios with topological-spread constraint vs HRP/NCO/MV; cross-crypto universality. Tests H4 and H5.
Phase 06planned
Practitioner workflow
Build out the topo-select / topo-portfolio / topo-audit / topo-refit CLI; live-deployment integration; per-cohort case studies.
Phase 07planned
Synthesis & release
Five papers, applied monograph, deployable software package, long-form daru.finance survey article, Zenodo DOI snapshot.

Status as of 2026-05-09, Phase 1 first-light findings live in the program log

Phase 01 · In flight, Phase 1 first-light underway

Foundations & pilot

The corpus is ~1.4M backtested strategies across 27 crypto assets, 8 indicator families, and 6–30 walk-forward windows per asset. Phase 1 builds the unified parquet, the ~50-feature behavior fingerprint per strategy, the per-family parameter parser, and the distance-matrix infrastructure that scales (Faiss IVF-PQ for the 1M+ corpora).

First-light findings on 3 assets (BTC, DOGE, AVAX, 127,879 strategies, 24.8M long-form rows, 13 minutes of wall-time): the kNN-aggregate predicts OOS Sharpe with mean Pearson 0.94 and Spearman 0.37, top-H₀ persistence is uniform 6.0–7.1 z-units across cohorts, top-H₁ persistence (0.47–1.46) discriminates indicator families, and HDBSCAN finds zero clusters in 23/24 cohorts. The behavior-space manifold is smooth, not partitioned, which already changes the H4 method choice from raw HDBSCAN to Mapper / ToMATo / sublevel-set persistence.

Falsifiability is non-negotiable. Every empirical chapter has a pre-registered falsification criterion locked at the end of Phase 1.
Phase gate: pre-registration must be locked and the pilot reproducible bit-for-bit before any predictive claim is filed.

Phase 02 · Planned, kicks off after Phase 1 lock

Geometric analysis (Part II of the monograph)

Six distance metrics on strategy populations, parameter-space, behavior-space, equity-curve W₂, trade-level Wasserstein, Gromov–Wasserstein, and Fisher–Rao, characterised in Ch 4. Density and manifold structure (Ch 5) report intrinsic dimension via TwoNN and MLE, and run the manifold-learning suite for visualisation only. Plateau detection (Ch 6) combines persistent superlevel sets on IS Sharpe, discrete Morse theory on the parameter grid, HDBSCAN on the joint (parameter, IS Sharpe) space, and SCMS ridge fits.

Two hypothesis tests: H1 (kNN aggregates strictly improve OOS-Sharpe prediction over self-historical baselines, ΔR² ≥ 0.05) and H2 (plateau strategies strictly outperform isolated peaks at matched IS-Sharpe quintile, median ΔSharpe ≥ 0.10). Results land in Paper A.

Phase 03 · Planned, the core TDA contribution

Persistent topology, four paradigms

Single-filtration PH (Ch 7) computes Vietoris–Rips, alpha, cubical, and witness complexes across all 27 assets, all families, all cohorts; PDs are vectorised into persistence images, persistence landscapes, sliced-Wasserstein kernel features, and PersLay embeddings.

Multipersistence (Ch 8, RIVET) treats the bifiltration on (IS-Sharpe percentile, density percentile) as a robustness map. Zigzag persistence (Ch 9, dionysus2) tracks insertion/deletion of strategies across WFO windows, a per-strategy temporal-stability fingerprint usable as a refitting-cadence signal. Sheaf cohomology (Ch 10) over the WFO open cover gives a single scalar H¹ score per cohort that predicts portfolio-level aggregation risk.

The make-or-break test is H3: adding topological features to a regressor that already contains XGBoost, TabPFN, TS2Vec, contrastive-encoder, and a stacked ensemble of those baselines must strictly raise OOS predictive accuracy by ΔR² ≥ 0.03. If it does not, the program reframes around topology-as-enrichment rather than topology-as-replacement and reports the result honestly.

Phase 04 · Planned

Persistent local homology + information geometry

H10 asks whether persistent local homology of an ε-ball around each strategy carries incremental predictive power beyond cohort-level features, and, if so, whether the per-strategy local-homology vector can serve as an interpretable audit-trail diagnostic.

H11 cross-checks the engineered behavior fingerprint against Fisher–Rao distances on OOS-return distributions. The rank correlation is the sanity check; the nested-model ΔR² is the keep-or-cut decision. If H11 fails, the engineered fingerprint is missing distribution-level information and the program documents that gap rather than papering over it.

Phase 05 · Planned, applied half of Part IV

Topo-aware portfolios + cross-crypto universality

H4 builds portfolios that sample one strategy per Mapper cluster in behavior-space, weighted by the Beta-Binomial profitability lower-bound, with an explicit topological-spread constraint. Baselines: equal-weighted, correlation-clustered, mean-variance, HRP, and NCO from the same pool. The bar is ΔSortino ≥ 0.15 against the strongest baseline, paired-bootstrapped across 27 assets × N windows × portfolio sizes (k ∈ 5, 10, 20, 50).

H5 tests cross-crypto universality. If persistence diagrams of behavior-space across L1-majors / alt-L1s / memes / DeFi / L2-and-newer cohorts are closer to each other (in bottleneck and Gromov–Wasserstein) than to permutation-shuffled nulls, training a topological selector on a pooled cohort and applying it to a fresh asset is statistically justified, and it shortens deployment time on every new asset that follows.

Phase 06 · Planned, the practitioner-facing surface

Practitioner workflow & live deployment

Every claim in the monograph corresponds to one tested function in apps/: topo-select (corpus → recommended portfolio), topo-portfolio (the topo-spread constructor), topo-audit (per-strategy local-homology signature), topo-refit (zigzag-based refitting cadence), aggregation-check (sheaf H¹ scorer), and universality-check (cohort-pooling justification report).

Ch 16 reports per-stage latency budgets and identifies which methods are skippable in production: multipersistence and sheaves are compute-heavy and fall back to the cheap subset that retains most of the predictive lift, so the production path stays well within a real-time refit window.

Phase 07 · Planned, public release

Synthesis, papers, monograph, public package

Output portfolio: five papers (Plateau & Neighborhood Generalization · Persistent Topology of Strategy-Space · Topology-Aware Portfolios · Cross-Crypto Universality · Applied Topological Workflow), one applied monograph at ~150–250 pages, one public MIT-licensed software package, five interactive HTML visualisations embedded in the daru.finance survey article, and a Zenodo DOI for the reproducibility container.

Phase gate: an external user-test on 1–2 practitioners outside the project, before public release. The deliverable in one sentence, a CLI that takes a fresh corpus of backtested strategies and returns a recommended portfolio with per-strategy audit trails, cohort-level aggregation-risk scores, and refit-cadence recommendations, all validated to thesis-grade rigor and reproducible from a public container.

Make-or-break, falsification, scope

Make-or-break test

H3 against the modern-ML baseline stack (XGBoost, TabPFN, TS2Vec, contrastive equity-curve encoder, and their stacked ensemble). If incremental adj-R² < 0.03, the topology stack does not earn its place, and the program reframes Part V around topology as an enrichment of an ML pipeline rather than a replacement.

Multiple-testing

Holm–Bonferroni FWER ≤ 0.05 across the 11 primary hypotheses. The 3-asset pilot is QA-only and does not enter the final inference; pre-registration is locked at the end of Phase 1.

Scope decisions

Crypto-only for v1. Hybrid strategy-space (per-family parameter + unified behavior). No theoretical novelty pursued, sheaves and information-geometry used as applied analysis tools that produce scalars feeding downstream decisions.

Back to changelog See the current live version