QuantyFrog | Backtesting our Delusional Love Letters to Ourselves

Backtesting is perhaps the most essential component of trading: it allows us to pretend to have scientific rigor as we define the rules, feed the data, calculate Sharpe ratios, and indulge in a variety of other metrics that promise to measure objectivity. More than anything else, I feel like they're a measure of not only the quantitative trading system but the quant themself. Even though we may want it to balance our intuition, more often than not, backtests are only used to soothe our anxiety.

The most insidious bias in backtesting is what statisticians Gelman and Loken (2013) called the garden of forking paths. They write that even when researchers do not intend to p-hack or cherry pick results from data, there still exist degrees of researcher freedom that may invalidate statistical inferences. To explain this phenomena, they go over multiple examples in different studies to explain how, even though an analysis performed on a given set of data is done with the intention of explaining a particular hypothesis, it is all entirely dependent upon the research choices or the decision variable that leads to the hypothesis. A study on political ideology and context should also account for the differences in men and women and a study on upper body strength of men should not only account for the socioeconomic interactions between arm circumference and a narrow college aged cohort. In finance, this problem is amplified tenfold: a simple momentum strategy could look vastly different if we consider 20 day, 40 day, or 252 day lookbacks. What if we include midcaps and smallcaps? How often should we rebalance? Every fork in decision making could lead to a significantly different equity curve being popped out by your model and there's no way of telling if that's the whole, objective truth. The more trials we run, the higher the probability of a fork that's just dressing up the reality.

The next obvious step from questioning the tests themselves is to question the parameters we use to test the assumptions. What makes 20 day lookbacks, 50 day or even 200 day moving average crossovers the basis of so many trading models? Even though there may be radical changes in the decision making process as a result of changing these parameters, we stick with them as a result of cultural founding. This subjectivity is often invisible but they still exist. We are drawn to equity curves that look smooth, steady, and reassuring. We distrust strategies that produce erratic results even if their long-run risk-adjusted return is superior. Taleb observed this tendency in Fooled by Randomness: "People prefer strategies that provide steady returns and ignore the possibility of rare, catastrophic losses" (2001, Ch. 6, p. 87). Famous quant Benn Eifert often talks about this on X (formerly Twitter) as an antagonist to a vocal minority of option traders advocating for "the wheel" strategy. Essentially selling covered calls and then averaging down as necessary. While the income is stable sometimes, the potential of a looming IV crush should scare the average retail trader into at least understanding derivatives first. This aesthetic bias is not trivial as it influences which strategies survive, which get published, and which attract investor attention. Our pattern-seeking brains are especially vulnerable. "We are natural pattern seekers; we see order in randomness and mistake it for meaning" (Fooled by Randomness, Ch. 11, p. 148). A clean equity curve tells a compelling story, even if that story is noise.

When it comes to backtesting, a trend follower will always "find" trend. A value quant will always "find" value. A machine learning enthusiast will always "discover" nonlinear predictive structure. Backtests are less about what the market reveals, and more about what the analyst expects to find. This is compounded by survivorship bias. As Taleb wrote: "We see the winners and try to learn from them, while forgetting the huge number of losers" (Fooled by Randomness, 2001, Ch. 8, p. 115). The danger here is not only statistical, but emotional. Statistically, multiple forks and biased parameters make overfitting nearly inevitable. "Backtest overfitting is not only likely but nearly certain when multiple specifications are tested" (Bailey et al., 2014, p. 1). Emotionally, we fall in love with our own equity curves. We choose strategies that look comfortable, not ones that are robust.

The discipline, then, is to read our own love letters critically.