Research review · causal inference · study 14
Causal Factor Investing
Association is not causation, and the same factor regression that finds a star can be made to invent one, kill a real one, or print a phantom, depending only on the causal graph behind it
The factor-investing literature reports associations, regressions of returns on candidate “factors”, and tacitly reads them as causes, without stating the causal graph that would license that reading. But whether an associational coefficient identifies a causal effect depends entirely on the structure behind it, and the three elementary structures pull a naive regression in opposite directions. This study turns López de Prado's critique into a controlled simulation plus a real, no-look-ahead cross-asset demonstration , and the playground below lets you watch a factor regression be made to lie in the browser.
Causal Factor Investing (2023)
López de Prado · Cambridge Elements in Quant. Finance
Code & data
lopez-de-prado-work-review / projects / 14
the claim
What López de Prado says
- The factor-investing literature reports associations, regressions of returns on candidate factors, and tacitly reads them as causes, without ever stating or defending the causal graph that would license that reading.
- Whether an associational coefficient identifies a causal effect depends entirely on the structure: a fork (confounder) manufactures a spurious effect, conditioning on a mediator destroys a real one, and conditioning on a collider creates significance from nothing. Get the adjustment set wrong via the backdoor criterion and a spurious "factor" looks significant.
our result
What we found
- A 20,000-sim Monte Carlo reproduces the bias exactly: the naive estimator's error scales monotonically with the strength of the mis-handled structure (96% false-positive rate by strength 0.30), while the structurally-correct estimator holds its nominal 5%.
- On a real 26-instrument panel, correct backdoor adjustment flips 2 of 9 significant factor verdicts and attenuates ~⅔ of the raw signal; only 1 of 9 survives a second independent confounder.
- The honest verdict: zero factors survive deflation, the best associational sleeve posts DSR 0.455, far below 0.95. The chain of evidence breaks exactly where he says it does, but no candidate ever reaches a defended causal graph; this is a pedagogical demonstration, not a money result.
The result in three lines
Monte Carlo · 20,000 sims
The three structures reproduce LdP exactly
A confounder (fork) makes a null factor look significant 100% of the time under naive regression, fixed to the nominal 5% by backdoor adjustment. Over-controlling a mediator (chain) collapses a real effect's power to ~5%. Conditioning on a collider manufactures a −0.50 factor at 100% significance from a true effect of zero. Every Monte-Carlo number is verified bit-identical (max |Δ| ≈ 1e-13) against an independent reference.
Real factors · 26 instruments · no look-ahead
Backdoor adjustment flips verdicts; almost nothing survives
Three lagged, causal factors (momentum, realized vol, short reversal) across crypto, US equities and forex. The market-mean backdoor flips 2 of 9 significance verdicts, both spurious-killed, and roughly two-thirds of a typical raw factor signal (67% median |t| attenuation) is confounder leaking through. Only 1 of 9 cells survives a second, independent confounder.
DSR 0.46 · bar 0.95
The best associational sleeve fails deflation
The best naive long-short sleeve (equities low-vol, annualised SR ≈ 0.47) deflates to DSR 0.46 against the False-Strategy-Theorem benchmark for 9 trials, far below the 0.95 bar. A falsification scorecard walks each candidate up the evidence hierarchy; the chain of evidence breaks exactly where LdP says it does. This is a methodological result, not a money result.
The three structures, a controlled Monte Carlo
The same regression invents a factor, kills a real one, or prints a phantom
Fork / confounder
true X→Y effect = 0
Chain / mediator
true total effect = (b·d) = s²
Collider
true X→Y effect = 0
For each of 20,000 independent datasets we draw fresh data from a linear-Gaussian model of each structure and fit two regressions: the naive simple regression of return on factor, and the regression a careful analyst would run given the graph. The sampling distributions tell three different stories from the same arithmetic.
In the fork, a common cause makes a null factor significant 100% of the time; the back-door adjustment restores the nominal 5%. In the chain, the naive regression is right, and over-controlling for the mediator collapses a real effect's power back to 5%. In the collider, the naive regression is right at zero, but conditioning on the collider manufactures a −0.50 factor at 100% significance.
| structure | true effect | naive coef | naive reject | adjusted coef | adj. reject |
|---|---|---|---|---|---|
| Fork / confounderZ→X, Z→Y | 0.00 | 0.500 | 100% | 0.000 | 5.0% |
| Chain / mediatorX→M→Y | 0.64 | 0.640 | 100% | −0.000 | 5.1% |
| ColliderX→C←Y | 0.00 | −0.000 | 4.9% | −0.500 | 100% |
Coefficients and rejection rates are MC means over 20,000 sims; “reject” is the share with |t|>1.96. The naive regression is significant and wrong in the fork, correct in the chain (where the hazard is over-control), and correct in the collider (where the hazard is conditioning). Every figure is verified bit-identical against an independent reference (max |Δ| ≈ 1e-13).
Make a factor regression lie
Pick a structure, then toggle 'condition on Z'. In a confounder, conditioning closes the spurious back-door and the association collapses to its true zero; in a mediator it blocks a real effect; in a collider it opens a path that was closed and manufactures one. The scatter and read-outs are computed from a fixed-seed sample.
The point of the playground is that bias is not a knife-edge pathology, it is the generic behavior of the wrong adjustment set, and it scales smoothly with the strength of the structure you mishandle. Slide the fork's confounder strength up and a null factor's naive coefficient climbs toward 0.50; slide the chain's path strength and the over-controlled estimate stays pinned at zero while the true effect grows; the collider invents a steadily larger phantom the harder its two parents push on it.
Demo, confounder / mediator / collider playground
Pick a causal structure, then toggle 'condition on Z' and watch the observed X-Y association open or close. In a confounder, conditioning closes the spurious back-door; in a mediator it destroys a real effect; in a collider it manufactures one from nothing. The scatter is a fixed-seed draw from each structure and the read-outs are computed from those exact points.
X ⇆ Y path OPEN, association flows
s = confounder loading (a = c)
The confounder Z is wide open: X and Y move together with no causal link between them. The naive correlation is pure spurious factor.
conditioning moves you TOWARD the truth
Bias scales with the structure, error hits 96% by strength 0.30
Sweeping the structural-bias strength makes the single quantitative claim that summarizes the whole critique: the naive or mis-conditioned estimator's decision error scales monotonically with the strength of the (mis)handled structure, already 96% by strength 0.30, while the structurally-correct estimator holds its nominal 5% error (fork, collider) or recovers full power (chain) at every dose.
Fork: false-positive rate (true X→Y = 0)
Collider: false-positive rate (true X→Y = 0)
Chain: power to detect a REAL effect (true total = (b·d)²)
Real cross-asset factors: backdoor adjustment flips verdicts
The same machine, on real data. Three lagged, causal factors, 21-day momentum, 21-day realized vol (a low-vol proxy), 5-day short reversal, run through a pooled within-market panel regression of next-day return across 26 instruments in crypto, US equities and forex. We then add the market-mean (the obvious common driver of both factor and return) as a back-door control, and a second independent confounder, lagged common realized vol. A factor is a candidate survivor only if it clears both.
Four of nine cells are significant naively. The market-mean backdoor flips 2 verdicts, equities momentum and equities vol, and both are spurious_killed: the apparent factor was largely the market confounder. The median |t| attenuation across naively-significant cells is 67%: roughly two-thirds of the raw factor signal is confounder. Only 1 of 9 cells survives both backdoors (equities short-reversal), and even there the t drops from −12.0 to −4.9, and reversal is a microstructure / bid-ask-bounce effect, not a defended risk premium. Crypto vol even changes sign between the two confounders, a tell that it is conditioning artifact, not structure.
| market | factor | naive t | +market t | +lagged-vol t | |t| atten. | flip |
|---|---|---|---|---|---|---|
| crypto | momentum | 2.26 | 2.11 | 1.73 | 7% | , |
| crypto | vol | 1.36 | 0.36 | −1.26 | 74% | , |
| crypto | reversal | −0.23 | 0.66 | −0.39 | , | , |
| equities | momentum | −6.39 | −1.60 | −6.26 | 75% | spurious_killed |
| equities | vol | 2.14 | 0.53 | 1.65 | 75% | spurious_killed |
| equities | reversal | −11.98 | −4.88 | −11.96 | 59% | , |
| forex | momentum | 0.02 | 0.50 | 0.01 | , | , |
| forex | vol | 0.76 | 0.18 | 0.51 | 76% | , |
| forex | reversal | −1.84 | −1.71 | −1.84 | 7% | , |
All factors are lagged (value at day t uses only information up to t−1), so nothing predicts return t with future data. The t-stats are from pooled within-market panel regressions on real daily returns.
The best associational sleeve fails deflation
The headline metric is the Deflated Sharpe Ratio on the best naive long-short factor sleeve, treating the nine market×factor cells as the trial count. The best sleeve is equities low-vol, annualised Sharpe ≈ 0.47. Against the expected-max-Sharpe benchmark for nine trials:
DSR = 0.46, far below the 0.95 bar. The best-looking associational sleeve does not survive multiple-testing deflation; the “edge” is consistent with luck across nine trials.
The chain of evidence breaks where LdP says it does
A hierarchy-of-evidence scorecard walks each candidate signal up the ladder. The highest level any candidate reaches is 3: the lone survivor of the conditioning tests (equities reversal) is killed at L5 (deflation) and never reaches a defended causal graph or interventional evidence. Association is cheap, conditioning is fragile, and deflation plus a defended graph, which the literature almost never supplies, is where the “factors” disappear.
| level | question | verdict |
|---|---|---|
| L1 | Association exists? (raw |t|>1.96) | PASS, 4/9 cells significant naively |
| L2 | Survives the market-mean backdoor? | PASS, 2/9; 2 verdicts flip; 67% median |t| attenuation |
| L3 | Survives a second, independent confounder (lagged common vol)? | PASS, only 1/9 clears both (equities reversal) |
| L4 | Is the causal graph stated & defended? | FAIL, no defended SEM supplied for any survivor |
| L5 | Survives multiple-testing deflation (DSR>0.95)? | FAIL, best DSR 0.46 |
| L6 | Interventional / do-evidence? | FAIL, all evidence observational |
| L7 | True OOS / cross-market replication of the causal claim? | FAIL, no survivor replicates across all three markets |
Climbing toward L6
A perp-listing natural experiment gives the study its first non-observational evidence
Continuation-sleeve abnormal return after perp listing
trading days since
perpetual listing
Reversal-factor t-stat: discontinuity at the event
Every result above is observational conditioning, it tops out at L3. To reach for interventional (L6) evidence we need a shock whose timing is set by something other than the effect we are testing. A Binance USD-M perpetual-futures listing is exactly that: an exogenous structural change to an asset’s microstructure (leverage, short access, funding mechanics, a broad speculative base) whose date is chosen by the exchange, not by short-term reversal. That makes it a natural experiment, and we apply it to the program’s lone L3 survivor, short-term reversal.
The pre-registered, directional prediction: if reversal is a microstructure / liquidity-provision effect, its strength should jump discontinuously at the listing. For each of 170 qualifying listings the pre-window is the token’s own spot returns and the post-window its perpetual returns (same asset, the instrument switch is the treatment); the lagged 5-day reversal factor is computed within each side separately, so no value straddles the boundary. We fit a regression discontinuity in the factor effect (pre vs post slope, plus an interaction term) and an event-time abnormal return of a continuation sleeve, with a control set of ten established majors, cut at the same calendar dates but with no listing event, as a placebo / parallel-trends check.
| group | window | reversal slope | t-stat | n |
|---|---|---|---|---|
| treated | pre (spot) | −0.0008 | −0.50 | 8,671 |
| treated | post (perp) | −0.0050 | −6.49 | 9,350 |
| treated | post − pre (RD interaction) | −0.0042 | −2.55 | 18,021 |
| control (placebo) | pre (spot) | −0.0025 | −3.02 | 8,738 |
| control (placebo) | post (perp) | −0.0019 | −3.28 | 9,350 |
| control (placebo) | post − pre (RD interaction) | +0.0006 | +0.58 | 18,088 |
Pooled-panel reversal slope and t-stat, pre vs post the exogenous listing, with the regression-discontinuity interaction term. The treated discontinuity is significant (t = −2.55); the placebo control discontinuity is flat (t = +0.58).
In the treated assets reversal is absent before listing (t = −0.50) and strongly present after (t = −6.49); the discontinuity is significant (RD interaction t = −2.55). The control majors stay flat across the same dates (interaction t = +0.58), so the jump tracks the event, not the calendar. This is the first piece of evidence in the program that is not pure observational conditioning: it reaches partial L6 (interventional) support for a microstructure reading of reversal, and correspondingly weakens any reading of it as a defended risk premium.
Honest caveat. The discontinuity is a pooled-panel average: asset by asset, the post-listing reversal |t| exceeds the pre-listing |t| in only 48% of events. So this is evidence about the typical newly-listed token’s microstructure, not a per-asset law. It does not rescue the factor at L5 (the deflated-Sharpe verdict is unchanged), L4 (no formal defended graph), or L7 (no cross-market causal replication). The chain of evidence is exactly as fragile as before , the experiment sharpens what the surviving signal is (microstructure, not premium) without turning it into a risk factor.
Method
- Monte Carlo (part a): linear-Gaussian structural equation models, fork (Z→X, Z→Y), chain (X→M→Y) and collider (X→C←Y), with 20,000 independent datasets of 2,000 observations each. For every dataset we fit the naive simple regression Y~X and the structurally-motivated adjusted regression, record the sampling distribution of the X-coefficient, and the rejection rate at |t|>1.96.
- Decision-theoretic framing: for the fork and collider the true effect is 0, so a rejection is a Type-I false positive; for the chain the effect is real, so a non-rejection is a false negative (lost power). The dose-response sweep traces how naive vs. corrected decision error responds as the structural-bias strength rises over [0, 1.2].
- Real cross-asset panel (part b): daily simple returns for 26 instruments across three markets (11 crypto perps, 7 US-equity ETFs, 8 forex majors), built from 1-minute source data. Every factor is causal, each value at day t uses only information up to t−1 (explicit lag), so nothing predicts return t with future data. Real data only.
- Naive vs. backdoor: a pooled within-market panel regression of next-day return on the lagged, standardized factor, then the same with the market-mean (the obvious common driver) added, then with a second independent confounder, lagged common realized vol. A factor is a candidate survivor only if it clears both backdoors. We report the t-stat under each spec, whether the verdict flips, the kind of flip, and the |t| attenuation.
- Headline metric (part c): the Deflated Sharpe Ratio on the best naive long-short factor sleeve, treating the market × factor set (9 cells) as the trial count, with the expected-max-Sharpe benchmark from the False Strategy Theorem. A hierarchy-of-evidence scorecard pinpoints the rung at which each surviving signal breaks.
Notes & honest assessment
This is a methodological / pedagogical contribution, not a “factor makes money” result, and it is framed that way throughout. The structural-equation and back-door results are standard Pearl / López de Prado; the real-data part is a demonstration of the critique, not the discovery of an edge. To upgrade toward a positive result one would need, for at least one factor, a defended causal graph and interventional / natural-experiment evidence, which this study deliberately leaves open as the honest frontier. What it does cleanly is assemble, in one place, a Monte Carlo that reproduces the three biases with a dose-response and a bit-identical cross-check, a real multi-market no-look-ahead two-confounder demonstration, and a quantitative falsification scorecard.
Reproducibility
The simulation, the real-panel pipeline, the figures and the verification harness are collected in project 14 of lopez-de-prado-work-review. The playground on this page is self-contained: the closed forms of the simulated structures and the dose-response anchors are encoded in the component, so the mechanic it illustrates reproduces exactly on every load.
Cite
References
The primary sources for the critique reviewed here:
- López de Prado, M. (2023). Causal Factor Investing: Can Factor Investing Become Scientific? Cambridge Elements in Quantitative Finance.
- López de Prado, M., & Zoonekynd, V. (2024). Why Has Factor Investing Failed?: The Role of Specification Errors. SSRN working paper.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
- Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).
See also
The companion study on multiple-testing deflation is Backtest Overfitting & the Deflated Sharpe Ratio, the selection-discipline theme runs through The edge is in the process, and the broader body of work is at Research.

