Sentivue/Research/Methodology

Research

Statistical Significance in Trading: How Many Trades You Actually Need

Most retail backtests claim edge they cannot prove. Here's the math on minimum sample sizes for credible strategy validation — and why "looks profitable" usually doesn't pass the bar.

Sentivue Capital··8 min read

"This strategy has a Sharpe of 1.5 over 18 months."

That sentence is, statistically, almost meaningless. With 18 months of daily data and a Sharpe estimate that has a standard error of roughly 0.6 annualized, the strategy's true Sharpe is — at the 95% confidence level — somewhere between 0.3 and 2.7. That range includes "edgeless" and "exceptional." The point estimate of 1.5 is the wrong number to focus on.

This is the single most common failure mode in retail systematic trading: claiming edge from samples too small to prove it.

The math

For a strategy with no autocorrelation in returns, the standard error of the Sharpe ratio estimate is approximately:

SE(Sharpe) ≈ √((1 + Sharpe² / 2) / T)

where T is the number of return observations. For a Sharpe of 1.0:

Trades / observationsSE95% CI on Sharpe estimate
500.21(0.59, 1.41)
1000.15(0.71, 1.29)
2500.10(0.81, 1.19)
5000.07(0.86, 1.14)
1,0000.05(0.90, 1.10)
2,5000.03(0.94, 1.06)

To distinguish a Sharpe of 1.0 from a Sharpe of 0.5 with 95% confidence requires roughly 250 independent return observations. That's about a year of daily data — for a strategy that trades daily.

The trade-count gotcha

The relevant sample size is the number of independent return observations, not the number of trades or the elapsed calendar time. A strategy that holds positions for two weeks at a time has return observations roughly weekly even if you sample daily. The "1,000 daily observations" sample is statistically equivalent to about 200 independent samples for that strategy.

This means strategies with longer holding periods need calendar years of history to be statistically credible. A weekly-rebalanced strategy needs ~5 years for a Sharpe of 1.0 to be statistically distinguishable from 0.5.

What this means in practice

  • Strategies tested on less than ~250 independent return observations should be considered hypothesis-grade, not deployable.
  • Live track records under one year are noise-dominated for typical systematic frequencies.
  • The headline Sharpe is the point estimate; the standard error is the strategy's actual story. Always report both.

Why retail backtests routinely violate this

Retail backtest tools encourage running strategies on whatever data is convenient — often 2–5 years. The sample is too short for the claimed Sharpe levels. The trader doesn't see the standard error because the tool doesn't compute it. The headline number gets cited as if it were known precisely.

The fix:

  1. Compute SE alongside Sharpe. Always.
  2. Default to longer history. 10+ years where instrument data permits.
  3. Account for overfitting deflation. The reported Sharpe over-states true edge by a factor that scales with the number of strategy trials searched.

Hypothesis-grade vs deployment-grade

A reasonable Sentivue rubric:

  • Hypothesis-grade (≤ 250 observations): strategy is interesting; warrants further research; not deployable.
  • Deployment-candidate (250–1,000 observations): plausibly real edge; deploy at conservative size with monitoring; scale only on live confirmation.
  • Deployment-grade (1,000+ observations): edge is statistically established; sizing decisions are about Kelly and drawdown, not about whether the edge exists.

Practical takeaways

  • 18-month backtests do not prove anything except that the strategy worked over those 18 months.
  • Standard errors on Sharpe estimates are large at retail-typical sample sizes. The point estimate is the start of the conversation, not the end.
  • The number that matters most is the standard error, not the Sharpe.

Related