Research · Systems

A Reproducible Walk-Forward Backtester

Cross-language parity, regime segmentation, and robustness stress testing for systematic trading research.

A research-grade backtester built twice: a Python reference and a specification-driven Rust port that produce numerically indistinguishable results on three deterministic configuration surfaces. When the same metric comes out of two independent implementations to within a thousandth, a measured edge is a property of the method, not an artifact of one engine. It is open source, Apache 2.0 licensed, and runs from a clean clone with no data downloads.

Coming soon

SSRN Working Paper

Under review, not yet posted.

Source on GitHub

github.com/DaruFinance/quant-research-framework-rs

In short

Speed

~28x faster, ~37x less memory

The Rust port runs the full default pipeline on 48,000 bars in about a quarter of a second, against roughly seven seconds for the Python reference, in single-digit megabytes of RAM.

Parity

156 / 156 within 1e-3

Re-running the three parity harnesses end to end, every deterministic-core metric point matched the Python reference at print precision (max deviation about 5e-5). The published record covers 210 points across the same surfaces.

The contribution

The bundle, not any one piece

Walk-forward optimisation, per-regime lookback tuning, strict no-look-ahead enforced at the trade ledger, and a cross-language parity record. Each exists somewhere; no other open framework ships all of it.

What this is

Most open-source backtesters give you an execution loop and leave the research discipline to you: how you split in-sample from out-of-sample, whether your indicators secretly peek at the future, and how an edge holds up once fees, slippage and funding are real. This framework moves that discipline into the engine. Walk-forward optimisation, regime segmentation, realism controls and a five-scenario robustness suite are built in, and a strict no-look-ahead invariant is enforced at the trade-ledger level so an author cannot accidentally trade on a bar that has not closed.

The same algorithmic specification is implemented twice. The Python reference is readable and easy to extend; the Rust port is built for speed. A parity harness compares their metric ledgers point by point, which is what lets the project claim its numbers are real rather than implementation lore. The figures below are generated from the framework's own runs.

How it fits together

Data flows through a shared indicator core into a strategy contract that returns one signal per bar, then through an execution core that applies every cost and the walk-forward orchestrator that re-optimises and forward-tests each window. The parity harness sits at the end, where both implementations' ledgers are expected to agree.

Fig. 1:System architecture. One algorithmic specification, implemented as a Python reference and a Rust port, converging on a parity harness that compares their metric ledgers.

Walk-forward, by construction

Each iteration optimises on a rolling in-sample window and forward tests on the next out-of-sample sub-window, then advances. Nothing is scored on the data it was fit to.

Fig. 2:Rolling walk-forward scheme. The in-sample window (IS, length L_IS) rolls forward; each step forward-tests the next out-of-sample sub-window (OOS, length V).

What a run produces

A single backtest here is not a number, it is a dossier. One run of the bundled EMA-crossover strategy on SOL/USDT 1h emits an optimised in-sample and out-of-sample report, a Monte Carlo distribution over resampled trade sequences, a five-scenario robustness sweep, and a rolling walk-forward across eighteen windows. Everything in this section is the actual output of that run, drawn by the framework itself and shown unedited.

It is worth saying plainly: this strategy loses. That is the point. A bundled example that printed money would be the suspicious thing. What the framework is built to do is make a weak strategy's weakness impossible to hide, and the figures below are it doing exactly that.

Equity curve of the optimised strategy out of sample, with four robustness scenarios overlaid, all drifting below the starting balance.
Fig. 3:Actual output. Optimised out-of-sample equity, with the robustness overlays. The curve drifts down, and every stress makes it worse: doubling fees (FEE) and a slippage shock (SLI, red) push it lower, a one-bar entry delay (ENT) barely moves it. A real edge would shrug these off.

The picture flips under walk-forward. When the look-back is re-optimised on each rolling in-sample window and then forward-tested, the stitched equity actually rises. That is the trap the rest of the pipeline exists to catch: per-window re-fitting flatters almost anything, so a single rising rolling curve is not evidence of anything.

Rolling walk-forward equity curve trending upward, with robustness overlays and an end-of-first-in-sample-window marker.
Fig. 4:Actual output. The rolling walk-forward equity with the same robustness overlays. It trends up precisely because each window is re-fit, which is why the deflation and Monte Carlo checks below exist to deflate it back to what it is worth.

The metrics it prints

Here is what the run reports. The optimised out-of-sample Sharpe is negative, and a Monte Carlo over resampled trade orderings ends in a loss on 99.8% of paths.

optimised report + monte carlo ranks
  IS-opt  (LB 47) | Trades: 118   ROI: $-735.77     PF: 0.61   Sharpe: -2.69   MaxDD: $804.82
 OOS-opt  (LB 47) | Trades: 755   ROI: $-2,077.85   PF: 0.81   Sharpe: -2.91   MaxDD: $2,258.93

 Monte-Carlo Percentile Ranks vs ACTUAL
            ROI:  24.5th       PF:  24.8th       Sharpe:  74.7th
        WinRate:  77.0th   MaxDrawdown:  55.4th  Consistency:  69.7th

 Simulations ending with LOSS:  99.8%

The robustness sweep re-runs the optimised baseline under four perturbations. Slippage shock is the worst case, roughly doubling the out-of-sample loss; a one-bar entry delay is nearly harmless. The spread between them is a cheap read on how fragile the result is to assumptions you cannot control.

robustness sweep (out-of-sample)
 Baseline OOS | ROI: $-2,077.85   PF: 0.81   Sharpe: -2.91   MaxDD: $2,258.93
     ENT OOS | ROI: $-1,623.33   PF: 0.84   Sharpe: -2.30   MaxDD: $1,809.32   entry drift +1 bar
     FEE OOS | ROI: $-2,832.73   PF: 0.74   Sharpe: -3.99   MaxDD: $2,876.37   fees x2
     SLI OOS | ROI: $-4,303.79   PF: 0.64   Sharpe: -6.11   MaxDD: $4,346.15   slippage shock
 ENT+IND OOS | ROI: $-1,629.68   PF: 0.85   Sharpe: -2.29   MaxDD: $1,875.23   drift + indicator jitter

And the rolling walk-forward, window by window. Some windows are positive (W01 forward-tests to plus $899), but a strategy that only works in some windows, on re-fit look-backs, is noise that the aggregate and the deflation gate are built to expose, not an edge.

walk-forward windows (W01 of 18)
 Running Walk-Forward Windows
  W01 IS  (LB 47) | Trades: 118   ROI: $308.25   PF: 1.14   Sharpe:  0.63   MaxDD: $448.35
 W01 OOS  (LB 47) | Trades:  92   ROI: $899.57   PF: 1.57   Sharpe:  1.95   MaxDD: $243.77
  ...   W02 through W18 each re-optimise on the rolling in-sample window
        and forward-test the next window on data they never saw   ...

Was it luck?

The Monte Carlo view resamples the trade sequence thousands of times and re-scores each metric, so you can see where the real result sits in the distribution of what randomness alone produces. When the actual value lands in the middle of a distribution centred on nothing, the result is indistinguishable from chance.

Histograms of ROI, max drawdown, profit factor, Sharpe and consistency over resampled trade sequences, with the actual value marked in red on each.
Fig. 5:Actual output. Bootstrap distributions of each metric over resampled trade sequences, with the realised value marked in red. The realised Sharpe and ROI sit squarely inside distributions centred near zero.

Performance

Both implementations run the identical default pipeline on slices of the same bundled dataset. The Rust port is roughly twenty-five to sixty times faster and holds peak memory about thirty-seven times lower, single-threaded.

Fig. 6:Wall-clock and peak memory, Python reference vs Rust port, by dataset size. Measured single-threaded on the same machine; the small-slice speed-up is inflated by the Python JIT compile cost, so the full-history figure is the honest one.

Cross-language parity

Speed is worthless if the fast engine is wrong. Re-running the three parity surfaces end to end, every deterministic-core metric point agreed with the Python reference within a thousandth. Two sections are intentionally non-deterministic by design and are excluded honestly, not hidden.

Fig. 7:Cross-language parity. 156 deterministic-core metric points across three surfaces (default, regime + walk-forward, forex), all within a 1e-3 relative tolerance; the maximum observed deviation is about 5e-5.

How it compares

Plenty of mature backtesters exist, and each capability here lives somewhere already. The point is the combination, and that it is enforced rather than left to convention.

What is built in

Clone it and see for yourself

Every step below runs from a clean clone with the bundled sample data, no downloads and no manual setup. The numbers are from a cold run on a laptop-class machine.

One thing to expect: the bundled strategies are simple demonstrations of the pipeline, not edges to trade. Run honestly out of sample with real costs, they are meant to lose. That is the framework exposing a weak strategy rather than flattering it, which is the whole point of evaluating this way.

Citation

The accompanying paper is under review at SSRN. A formal link will appear here once it is posted. In the meantime, the framework is citable from its repository.

See also

Within-strategy permutation testing for the selection method this backtester is built to support, and the Research Review for from-scratch reproductions that lean on exactly this kind of walk-forward, deflation-aware evaluation.

A Reproducible Walk-Forward Backtester: Systematic Trading Research | Daru Finance