Sentivue/Research/Methodology

Research

Why Most Retail Backtests Overfit — And the Institutional Fix

Most retail backtests beat the live performance by 2–5×. The reason is selection bias, and it is structural — not a discipline issue. Here's what institutional research does differently.

Sentivue Capital··9 min read

A retail trader builds a strategy on TradingView. Sharpe of 2.3 over five years. Daily wins on the equity curve. Smooth gains, modest drawdowns. They deploy it. Within six months it's losing money. They blame the market.

The market is not the problem. Selection bias is the problem.

What selection bias is

Selection bias in backtests is the systematic overstatement of strategy performance caused by the process of choosing which strategy to deploy. Three vectors compound:

  1. Parameter search. Try 1,000 parameter combinations. Pick the best. The reported performance is the maximum of 1,000 noisy estimates of the same underlying edge — not an unbiased estimate.

  2. Rule iteration. Start with a base rule. Add a filter to reject some bad trades. Add another filter. Add an exit condition. Each addition makes the equity curve look better in-sample. Each addition is an implicit parameter. By the time you've added eight filters, you've optimized 50 parameters' worth of degrees of freedom — even if you only think of yourself as having "one strategy."

  3. Data dredging. Run the same strategy template across 50 instruments. Report the best three. The other 47 are silent. The reported "diversified portfolio" is a survivorship-biased subset.

These vectors compound multiplicatively. A retail backtester running medium-sized parameter sweeps across many instruments with iterative rule refinement is searching tens of thousands of strategies, even if it doesn't feel like it.

The deflated Sharpe ratio

Bailey & López de Prado (2014) formalize the problem with the deflated Sharpe ratio (DSR). The DSR adjusts the reported Sharpe for the number of independent trials behind it:

DSR ≈ Sharpe − √(2 × ln(N) / T)

where N is the number of trials and T is the sample length. With N = 1,000 trials and T = 1,000 observations, the deflation is approximately 0.12 — a backtest Sharpe of 1.5 is, after deflation, equivalent to a Sharpe of 1.38 from a single un-searched strategy.

This is a massive simplification — in reality the deflation depends on the dependence structure between trials — but the qualitative insight is robust: the more strategies you searched, the more your reported Sharpe overstates true edge.

What institutions do differently

The institutional fix is not "more discipline." Disciplined retail researchers fall into the same traps. The fix is structural:

  1. Pre-registration. State the strategy hypothesis, the parameter range, and the evaluation metric before looking at data. Then run the analysis once. If it fails, the strategy fails. Do not iterate on the same data.

  2. Held-out final test set. Even with WFO and pre-registration, hold out a final 20% of history that no one in the research team has touched. Run the proposed live strategy on it once at the end. If the held-out metric is materially worse than the WFO result, the strategy is rejected.

  3. Strategy-search audit trail. Track every iteration. The total number of trials goes into the deflation calculation. If the number is large, the Sharpe threshold for deployment is correspondingly higher.

  4. Diverse research teams. When a single researcher iterates on a single strategy, they unconsciously optimize. Independent researchers reviewing the same hypothesis catch each other's biases.

  5. Live paper-trade as a separate evaluation phase. Even after WFO and held-out testing, deploy at zero risk for 1–3 months. Then deploy at conservative size. Then scale. See from backtest to live.

Diagnostic tells of overfitting

  • Walk-forward Sharpe < 50% of in-sample Sharpe. Strong tell.
  • Sharp degradation when parameters are perturbed ±20%. Robust strategies smooth out; overfit strategies fall off cliffs.
  • Strategy uses 5+ filters with vague economic rationale. Each filter is a parameter; the whole stack is fitted noise.
  • Reported equity curve has fewer than 100 trades. With small samples, statistical significance is essentially zero.
  • Strategy works on the test instrument but not on closely related instruments. Real edge generalizes; overfitting doesn't.

Practical takeaways

  • Assume your backtest is overfit. Then ask how to falsify it.
  • The number of trials you ran is part of the strategy's metadata. Track it.
  • Robust strategies have economic rationale you can articulate before showing the backtest. If you can only explain why it works after seeing the equity curve, it's overfit.

Related