Research review · causal inference · study 14

Causal Factor Investing

Association is not causation, and the same factor regression that finds a star can be made to invent one, kill a real one, or print a phantom, depending only on the causal graph behind it

The factor-investing literature reports associations, regressions of returns on candidate “factors”, and tacitly reads them as causes, without stating the causal graph that would license that reading. But whether an associational coefficient identifies a causal effect depends entirely on the structure behind it, and the three elementary structures pull a naive regression in opposite directions. This study turns López de Prado's critique into a controlled simulation plus a real, no-look-ahead cross-asset demonstration , and the playground below lets you watch a factor regression be made to lie in the browser.

source

Causal Factor Investing (2023)

López de Prado · Cambridge Elements in Quant. Finance

Code & data

lopez-de-prado-work-review / projects / 14

the claim

What López de Prado says

The factor-investing literature reports associations, regressions of returns on candidate factors, and tacitly reads them as causes, without ever stating or defending the causal graph that would license that reading.
Whether an associational coefficient identifies a causal effect depends entirely on the structure: a fork (confounder) manufactures a spurious effect, conditioning on a mediator destroys a real one, and conditioning on a collider creates significance from nothing. Get the adjustment set wrong via the backdoor criterion and a spurious "factor" looks significant.

our result

What we found

A 20,000-sim Monte Carlo reproduces the bias exactly: the naive estimator's error scales monotonically with the strength of the mis-handled structure (96% false-positive rate by strength 0.30), while the structurally-correct estimator holds its nominal 5%.
On a real 26-instrument panel, correct backdoor adjustment flips 2 of 9 significant factor verdicts and attenuates ~⅔ of the raw signal; only 1 of 9 survives a second independent confounder.
The honest verdict: zero factors survive deflation, the best associational sleeve posts DSR 0.455, far below 0.95. The chain of evidence breaks exactly where he says it does, but no candidate ever reaches a defended causal graph; this is a pedagogical demonstration, not a money result.

The result in three lines

Monte Carlo · 20,000 sims

The three structures reproduce LdP exactly

A confounder (fork) makes a null factor look significant 100% of the time under naive regression, fixed to the nominal 5% by backdoor adjustment. Over-controlling a mediator (chain) collapses a real effect's power to ~5%. Conditioning on a collider manufactures a −0.50 factor at 100% significance from a true effect of zero. Every Monte-Carlo number is verified bit-identical (max |Δ| ≈ 1e-13) against an independent reference.

Real factors · 26 instruments · no look-ahead

Backdoor adjustment flips verdicts; almost nothing survives

Three lagged, causal factors (momentum, realized vol, short reversal) across crypto, US equities and forex. The market-mean backdoor flips 2 of 9 significance verdicts, both spurious-killed, and roughly two-thirds of a typical raw factor signal (67% median |t| attenuation) is confounder leaking through. Only 1 of 9 cells survives a second, independent confounder.

DSR 0.46 · bar 0.95

The best associational sleeve fails deflation

The best naive long-short sleeve (equities low-vol, annualised SR ≈ 0.47) deflates to DSR 0.46 against the False-Strategy-Theorem benchmark for 9 trials, far below the 0.95 bar. A falsification scorecard walks each candidate up the evidence hierarchy; the chain of evidence breaks exactly where LdP says it does. This is a methodological result, not a money result.

The three structures, a controlled Monte Carlo

The same regression invents a factor, kills a real one, or prints a phantom

Fork / confounder

true X→Y effect = 0

Chain / mediator

true total effect = (b·d) = s²

Collider

true X→Y effect = 0

Fig. 1:Monte Carlo (20,000 sims × 2,000 obs). Left, fork: the naive Y~X regression piles up at 0.50 (true effect is 0); the backdoor-adjusted estimate sits on zero. Centre, chain: the naive regression correctly recovers the true total effect 0.64, but over-controlling for the mediator M crushes it to zero. Right, collider: the naive regression is correctly at zero, but conditioning on the collider C manufactures a −0.50 coefficient out of nothing.

For each of 20,000 independent datasets we draw fresh data from a linear-Gaussian model of each structure and fit two regressions: the naive simple regression of return on factor, and the regression a careful analyst would run given the graph. The sampling distributions tell three different stories from the same arithmetic.

In the fork, a common cause makes a null factor significant 100% of the time; the back-door adjustment restores the nominal 5%. In the chain, the naive regression is right, and over-controlling for the mediator collapses a real effect's power back to 5%. In the collider, the naive regression is right at zero, but conditioning on the collider manufactures a −0.50 factor at 100% significance.

structure	true effect	naive coef	naive reject	adjusted coef	adj. reject
Fork / confounderZ→X, Z→Y	0.00	0.500	100%	0.000	5.0%
Chain / mediatorX→M→Y	0.64	0.640	100%	−0.000	5.1%
ColliderX→C←Y	0.00	−0.000	4.9%	−0.500	100%

Coefficients and rejection rates are MC means over 20,000 sims; “reject” is the share with |t|>1.96. The naive regression is significant and wrong in the fork, correct in the chain (where the hazard is over-control), and correct in the collider (where the hazard is conditioning). Every figure is verified bit-identical against an independent reference (max |Δ| ≈ 1e-13).

Make a factor regression lie

Pick a structure, then toggle 'condition on Z'. In a confounder, conditioning closes the spurious back-door and the association collapses to its true zero; in a mediator it blocks a real effect; in a collider it opens a path that was closed and manufactures one. The scatter and read-outs are computed from a fixed-seed sample.

The point of the playground is that bias is not a knife-edge pathology, it is the generic behavior of the wrong adjustment set, and it scales smoothly with the strength of the structure you mishandle. Slide the fork's confounder strength up and a null factor's naive coefficient climbs toward 0.50; slide the chain's path strength and the over-controlled estimate stays pinned at zero while the true effect grows; the collider invents a steadily larger phantom the harder its two parents push on it.

Demo, confounder / mediator / collider playground

Pick a causal structure, then toggle 'condition on Z' and watch the observed X-Y association open or close. In a confounder, conditioning closes the spurious back-door; in a mediator it destroys a real effect; in a collider it manufactures one from nothing. The scatter is a fixed-seed draw from each structure and the read-outs are computed from those exact points.

X ⇆ Y path OPEN, association flows

condition on Zpartial Z out of X and Y, the regression's adjustment set

bias strength s0.90

00.61.2

s = confounder loading (a = c)

observed corr(X, Y)

…

true X→Y effect

0.000

spurious gap |obs − true|

…

The confounder Z is wide open: X and Y move together with no causal link between them. The naive correlation is pure spurious factor.

conditioning moves you TOWARD the truth

Bias scales with the structure, error hits 96% by strength 0.30

Sweeping the structural-bias strength makes the single quantitative claim that summarizes the whole critique: the naive or mis-conditioned estimator's decision error scales monotonically with the strength of the (mis)handled structure, already 96% by strength 0.30, while the structurally-correct estimator holds its nominal 5% error (fork, collider) or recovers full power (chain) at every dose.

Fork: false-positive rate (true X→Y = 0)

Collider: false-positive rate (true X→Y = 0)

Chain: power to detect a REAL effect (true total = (b·d)²)

Fig. 2:Dose-response sweep. Left, fork false-positive rate climbs from 5% to 100% as confounder strength rises (the back-door adjustment stays at 5%). Centre, conditioning on a collider drives its false-positive rate to 100% (the naive regression stays at 5%). Right, the naive regression keeps full power to detect the real chain effect, while over-controlling for the mediator holds power at the 5% no-power floor.

Fig. 3:Real factors: naive t-stat (x) vs. the market-mean backdoor-adjusted t-stat (y), nine market×factor cells. Points off the diagonal lost signal to the confounder; the highlighted points (equities momentum, equities vol) cross the |t|=1.96 line, their apparent significance was largely the market confounder leaking through.

Real cross-asset factors: backdoor adjustment flips verdicts

The same machine, on real data. Three lagged, causal factors, 21-day momentum, 21-day realized vol (a low-vol proxy), 5-day short reversal, run through a pooled within-market panel regression of next-day return across 26 instruments in crypto, US equities and forex. We then add the market-mean (the obvious common driver of both factor and return) as a back-door control, and a second independent confounder, lagged common realized vol. A factor is a candidate survivor only if it clears both.

Four of nine cells are significant naively. The market-mean backdoor flips 2 verdicts, equities momentum and equities vol, and both are spurious_killed: the apparent factor was largely the market confounder. The median |t| attenuation across naively-significant cells is 67%: roughly two-thirds of the raw factor signal is confounder. Only 1 of 9 cells survives both backdoors (equities short-reversal), and even there the t drops from −12.0 to −4.9, and reversal is a microstructure / bid-ask-bounce effect, not a defended risk premium. Crypto vol even changes sign between the two confounders, a tell that it is conditioning artifact, not structure.

market	factor	naive t	+market t	+lagged-vol t	\|t\| atten.	flip
crypto	momentum	2.26	2.11	1.73	7%	,
crypto	vol	1.36	0.36	−1.26	74%	,
crypto	reversal	−0.23	0.66	−0.39	,	,
equities	momentum	−6.39	−1.60	−6.26	75%	spurious_killed
equities	vol	2.14	0.53	1.65	75%	spurious_killed
equities	reversal	−11.98	−4.88	−11.96	59%	,
forex	momentum	0.02	0.50	0.01	,	,
forex	vol	0.76	0.18	0.51	76%	,
forex	reversal	−1.84	−1.71	−1.84	7%	,

All factors are lagged (value at day t uses only information up to t−1), so nothing predicts return t with future data. The t-stats are from pooled within-market panel regressions on real daily returns.

The best associational sleeve fails deflation

The headline metric is the Deflated Sharpe Ratio on the best naive long-short factor sleeve, treating the nine market×factor cells as the trial count. The best sleeve is equities low-vol, annualised Sharpe ≈ 0.47. Against the expected-max-Sharpe benchmark for nine trials:

DSR = 0.46, far below the 0.95 bar. The best-looking associational sleeve does not survive multiple-testing deflation; the “edge” is consistent with luck across nine trials.

The chain of evidence breaks where LdP says it does

A hierarchy-of-evidence scorecard walks each candidate signal up the ladder. The highest level any candidate reaches is 3: the lone survivor of the conditioning tests (equities reversal) is killed at L5 (deflation) and never reaches a defended causal graph or interventional evidence. Association is cheap, conditioning is fragile, and deflation plus a defended graph, which the literature almost never supplies, is where the “factors” disappear.

level	question	verdict
L1	Association exists? (raw \|t\|>1.96)	PASS, 4/9 cells significant naively
L2	Survives the market-mean backdoor?	PASS, 2/9; 2 verdicts flip; 67% median \|t\| attenuation
L3	Survives a second, independent confounder (lagged common vol)?	PASS, only 1/9 clears both (equities reversal)
L4	Is the causal graph stated & defended?	FAIL, no defended SEM supplied for any survivor
L5	Survives multiple-testing deflation (DSR>0.95)?	FAIL, best DSR 0.46
L6	Interventional / do-evidence?	FAIL, all evidence observational
L7	True OOS / cross-market replication of the causal claim?	FAIL, no survivor replicates across all three markets

Climbing toward L6

A perp-listing natural experiment gives the study its first non-observational evidence

Continuation-sleeve abnormal return after perp listing

continuation sleeve CAR95% bootstrap CI

trading days since
perpetual listing

Reversal-factor t-stat: discontinuity at the event

Fig. 4:Perp-listing natural experiment. Left: the continuation sleeve (which loses when reversal pays) earns −14.9% cumulative abnormal return over 30 post-listing days, net of each asset's own pre-listing baseline (bootstrap two-sided p = 0.021), a negative payoff means reversal strengthens. Right: the reversal effect is absent before listing in the treated assets and strong after, while the control majors barely move pre-to-post.

Every result above is observational conditioning, it tops out at L3. To reach for interventional (L6) evidence we need a shock whose timing is set by something other than the effect we are testing. A Binance USD-M perpetual-futures listing is exactly that: an exogenous structural change to an asset’s microstructure (leverage, short access, funding mechanics, a broad speculative base) whose date is chosen by the exchange, not by short-term reversal. That makes it a natural experiment, and we apply it to the program’s lone L3 survivor, short-term reversal.

The pre-registered, directional prediction: if reversal is a microstructure / liquidity-provision effect, its strength should jump discontinuously at the listing. For each of 170 qualifying listings the pre-window is the token’s own spot returns and the post-window its perpetual returns (same asset, the instrument switch is the treatment); the lagged 5-day reversal factor is computed within each side separately, so no value straddles the boundary. We fit a regression discontinuity in the factor effect (pre vs post slope, plus an interaction term) and an event-time abnormal return of a continuation sleeve, with a control set of ten established majors, cut at the same calendar dates but with no listing event, as a placebo / parallel-trends check.

group	window	reversal slope	t-stat	n
treated	pre (spot)	−0.0008	−0.50	8,671
treated	post (perp)	−0.0050	−6.49	9,350
treated	post − pre (RD interaction)	−0.0042	−2.55	18,021
control (placebo)	pre (spot)	−0.0025	−3.02	8,738
control (placebo)	post (perp)	−0.0019	−3.28	9,350
control (placebo)	post − pre (RD interaction)	+0.0006	+0.58	18,088

Pooled-panel reversal slope and t-stat, pre vs post the exogenous listing, with the regression-discontinuity interaction term. The treated discontinuity is significant (t = −2.55); the placebo control discontinuity is flat (t = +0.58).

In the treated assets reversal is absent before listing (t = −0.50) and strongly present after (t = −6.49); the discontinuity is significant (RD interaction t = −2.55). The control majors stay flat across the same dates (interaction t = +0.58), so the jump tracks the event, not the calendar. This is the first piece of evidence in the program that is not pure observational conditioning: it reaches partial L6 (interventional) support for a microstructure reading of reversal, and correspondingly weakens any reading of it as a defended risk premium.

Honest caveat. The discontinuity is a pooled-panel average: asset by asset, the post-listing reversal |t| exceeds the pre-listing |t| in only 48% of events. So this is evidence about the typical newly-listed token’s microstructure, not a per-asset law. It does not rescue the factor at L5 (the deflated-Sharpe verdict is unchanged), L4 (no formal defended graph), or L7 (no cross-market causal replication). The chain of evidence is exactly as fragile as before , the experiment sharpens what the surviving signal is (microstructure, not premium) without turning it into a risk factor.

Method

Monte Carlo (part a): linear-Gaussian structural equation models, fork (Z→X, Z→Y), chain (X→M→Y) and collider (X→C←Y), with 20,000 independent datasets of 2,000 observations each. For every dataset we fit the naive simple regression Y~X and the structurally-motivated adjusted regression, record the sampling distribution of the X-coefficient, and the rejection rate at |t|>1.96.
Decision-theoretic framing: for the fork and collider the true effect is 0, so a rejection is a Type-I false positive; for the chain the effect is real, so a non-rejection is a false negative (lost power). The dose-response sweep traces how naive vs. corrected decision error responds as the structural-bias strength rises over [0, 1.2].
Real cross-asset panel (part b): daily simple returns for 26 instruments across three markets (11 crypto perps, 7 US-equity ETFs, 8 forex majors), built from 1-minute source data. Every factor is causal, each value at day t uses only information up to t−1 (explicit lag), so nothing predicts return t with future data. Real data only.
Naive vs. backdoor: a pooled within-market panel regression of next-day return on the lagged, standardized factor, then the same with the market-mean (the obvious common driver) added, then with a second independent confounder, lagged common realized vol. A factor is a candidate survivor only if it clears both backdoors. We report the t-stat under each spec, whether the verdict flips, the kind of flip, and the |t| attenuation.
Headline metric (part c): the Deflated Sharpe Ratio on the best naive long-short factor sleeve, treating the market × factor set (9 cells) as the trial count, with the expected-max-Sharpe benchmark from the False Strategy Theorem. A hierarchy-of-evidence scorecard pinpoints the rung at which each surviving signal breaks.

Notes & honest assessment

This is a methodological / pedagogical contribution, not a “factor makes money” result, and it is framed that way throughout. The structural-equation and back-door results are standard Pearl / López de Prado; the real-data part is a demonstration of the critique, not the discovery of an edge. To upgrade toward a positive result one would need, for at least one factor, a defended causal graph and interventional / natural-experiment evidence, which this study deliberately leaves open as the honest frontier. What it does cleanly is assemble, in one place, a Monte Carlo that reproduces the three biases with a dose-response and a bit-identical cross-check, a real multi-market no-look-ahead two-confounder demonstration, and a quantitative falsification scorecard.

Reproducibility

The simulation, the real-panel pipeline, the figures and the verification harness are collected in project 14 of lopez-de-prado-work-review. The playground on this page is self-contained: the closed forms of the simulated structures and the dose-response anchors are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.

Cite

Cite as

Gatto, D. V. (2026). Causal Factor Investing: Association vs. Causation, Made Testable. Working paper. Review of López de Prado's causal-factor critique.

@techreport{gatto2026causal,
  author      = {Gatto, Daniel V.},
  title       = {Causal Factor Investing: Association vs. Causation,
                 Made Testable},
  year        = {2026},
  type        = {Working paper},
  note        = {Review of Lopez de Prado's causal-factor critique}
}

References

The primary sources for the critique reviewed here:

López de Prado, M. (2023). Causal Factor Investing: Can Factor Investing Become Scientific? Cambridge Elements in Quantitative Finance.
López de Prado, M., & Zoonekynd, V. (2024). Why Has Factor Investing Failed?: The Role of Specification Errors. SSRN working paper.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).

Causal Factor Investing

Causal Factor Investing (2023)

Code & data

What López de Prado says

What we found

The same regression invents a factor, kills a real one, or prints a phantom

Make a factor regression lie

Bias scales with the structure, error hits 96% by strength 0.30

Real cross-asset factors: backdoor adjustment flips verdicts

The best associational sleeve fails deflation

The chain of evidence breaks where LdP says it does

A perp-listing natural experiment gives the study its first non-observational evidence

Method

Notes & honest assessment

Reproducibility

Cite

References

See also