Backtest vs Walk-Forward vs Paper Trading: When to Use Which

Three evaluation regimes — single-period backtest, walk-forward optimization, forward paper-trade — get treated as alternatives in the retail literature. They are not alternatives. They serve different purposes and should be applied as a sequence.

What each regime answers

Single-period backtest

Question: does this strategy concept have any historical signature?

Strength: fast, cheap, useful for hypothesis screening. Weakness: susceptible to overfitting; says nothing about generalization.

Use: initial concept validation. Hundreds of candidate strategies enter; the ones that don't work even on a single backtest don't deserve further effort.

Walk-forward optimization

Question: does this strategy generalize across time when re-fit on rolling data?

Strength: structurally OOS within each step; tests parameter stability. Weakness: can still be meta-overfit by tweaking the WFO setup; doesn't reflect live execution costs.

Use: primary validation gate before any live consideration. Strategies that pass single-period but fail WFO are overfit, full stop.

Held-out test

Question: does WFO performance hold up on data the researcher has never seen?

Strength: truly OOS if the discipline is maintained (test once, no iteration). Weakness: single-shot; high variance in the result; tempting to "just one more look."

Use: final research-stack gate before operations.

Forward paper-trade

Question: does the strategy actually behave the way the simulator predicts when run on real-time data with realistic execution?

Strength: real-time execution; real spreads; real timing; data the strategy genuinely hasn't seen. Weakness: slow; small sample; doesn't catch edge degradation that takes longer than the paper-trade window.

Use: operational readiness gate before live capital.

Live conservative

Question: does the strategy work with real money on the line?

Strength: the only true measure; nothing approximates real fills, real slippage, real psychological pressure. Weakness: costs money to run; the data you generate is the data you're learning from.

Use: scale-up decision input. Do not skip.

The right sequence

Concept → Single backtest → WFO → Held-out → Paper-trade → Conservative live → Scale

Each gate has a defined pass/fail threshold pre-committed to before evaluation. Strategies that fail are retired or returned to research; they are not "tweaked to pass."

What goes wrong when researchers shortcut

Skipping WFO after a clean single backtest deploys overfit strategies.
Skipping held-out after a clean WFO deploys meta-overfit strategies.
Skipping paper-trade after a clean held-out deploys strategies with execution issues that surface only in real-time.
Skipping conservative live ramps capital before regime fit is confirmed.

Each shortcut feels efficient. Each compounds the failure rate.

The asymmetry of effort

The research effort to develop a strategy is typically 1–4 weeks. The operational effort to deploy it through the full sequence is typically 3–6 months. Researchers who skip operational steps are optimizing for time-to-deploy at the cost of deployment success rate. The math works against them.

Practical takeaways

The four regimes are sequential, not alternative.
Each gate has a pre-committed threshold. No in-the-moment relaxation.
The researcher who skips operational steps is optimizing the wrong objective.