All that glitters is not gold: Comparing backtest and out-of-sample performance on a large cohort of trading algorithms Wiecki et al., 2016
The quality and reliability of common algorithm performance metrics makes for an interesting area of research. The widespread application of data science to finance has enabled quants and traders to easily test and evaluate their strategies against previous market conditions. Backtesting on prior market data often forms the backbone of algorithm development, but how indicative are past results to future performance? The Quantopian team set out to answer this question, utilizing their vast database of resulting backtest metrics from hundreds of different algorithms. The authors focus on studying the effectiveness of a few commonly used metrics to measure the in-sample (IS) and out-of-sample (OOS) performance relationship.
Backtest results are often used as a proxy for the expected future performance of a strategy. Thus, in an effort to optimize expected out-of-sample (OOS) performance, quants often spend considerable time tuning algorithm parameters to produce optimal backtest performance on in-sample (IS) data.
The Analysis Process
Since this study was completed mainly by the team at Quantopian.com, they have unique access to the required data to undertake a study of backtesting performance metrics. Although the authors can’t directly see the code for each algorithm due to licences with the creators, they were able to utilize the resulting metrics from each algorithm. In total, the authors used a sample of 888 algorithms that met their criteria for the study (no duplicates, trade multiple stocks, etc).
The backtest results are used to analyze 57 individual features applied separately to both IS and OOS data (all features identified in the appendix of the study). However, the authors focus the analysis to a handful of the most commonly used performance metrics:
We constructed features based on point estimates of several well-known performance and risk metrics such as the Sharpe ratio, information ratio, Calmar ratio, alpha, beta, maximum drawdown, as well as annual returns and volatility.
An important addition was made by the authors to include the ‘total number of backtest days’ in order to capture the influence of strategy development techniques and biases. This will prove to be an important metric that sheds light into potential overfitting from running continuous backtests.
In-Sample vs Out-of-Sample Results
Our first set of analyses aims to evaluate the degree to which various IS performance metrics correlate with the same metric computed over only the OOS period.
The authors decided to use the Sharpe ratio as a baseline as to the maximum effect size to expect, and established it by testing the last year of the IS period to the preceding IS period, finding a relatively strong and highly significant linear relationship (R2=0.21, p<0.0001). As a refresher before we proceed further, the Sharpe ratio is a metric that measures the return of an investment while taking into account its overall risk.
The next step is a comparison of IS and OOS periods of each of the most common metrics mentioned earlier.
The main takeaway from the resulting graph is the negative correlation between the IS and OOS annual returns and the slight positive relationship between the IS and OOS Sharpe ratio. The authors note this is likely due to the Sharpe ratio calculation itself:
With the annual returns (Rp) in the numerator and volatility (σp) in the denominator, the IS Sharpe ratio can be increased either by increasing annual returns or decreasing volatility. This leads to algorithm developers focusing on maximizing the returns of the algorithms without taking volatility into account. It is then shown that in fact, algorithms that increase their annual returns while taking on additional volatility have a worse OOS Sharpe ratio than those that keep their volatility in check, leading to the opposite correlations outlined above.
I mentioned earlier that the authors inclusion of the total number of days a strategy was backtested as a feature was particularly interesting. Quants have the freedom to run backtests continuously, making it easy to unknowingly overfit an algorithm to the IS data. The study calculates a ‘Sharpe ratio shortfall’ consisting of the difference between the IS and OOS Sharpe ratio to quantify the effects of overfitting. This results in a weak but highly significant positive correlation between the logarithm of backtest days and the Sharpe ratio shortfall. This highlights that overfitting appears to become more prominent as the number of backtests increase due to the resulting higher IS Sharpe ratio and the decrease in OOS Sharpe ratio.
Additionally, the study also found that annual IS volatility showed a positive and significant correlation with the Sharpe ratio shortfall. Which again, using the same logic as above, highlights that as volatility increases so does the likelihood of overfitting.
Utilizing Machine Learning for Non-Linear Regression
Machine learning (ML) methods have proven useful in many applications for analyzing a full feature set, and so to it is evaluated here in terms of predicting OOS Sharpe ratio performance.
We next asked if non-linear regression methods trained on the full feature set could do better at predicting OOS Sharpe ratio performance. Towards this goal, we explored a number of machine learning techniques, including Random Forest and Gradient Boosting, to predict OOS Sharpe. To avoid overfitting, all experiments used a 5-fold cross-validation during hyperparameter optimization and a 20% hold-out set to evaluate performance.
Utilizing a Random Forest Regressor as a portfolio selection tool, the machine learning method performance was measured by having it compute the OOS cumulative return and Sharpe ratios for equal-weighted portfolios of 10 strategies, chosen by having the highest predicted Sharpe ratios on the hold-out set by the random forest regressor. This is then compared to 1000 random portfolios, and a portfolio formed by strictly selecting strategies with the 10 highest IS Sharpe ratios.
The results are surprising, as the ranking of portfolios by the predicted Sharpe ratios perform better than 99% of randomly selected portfolios with an overall Sharpe ratio of 1.8, compared to the IS Sharpe ratio which performed better than 92.16% of the randomly selected portfolios with a lower Sharpe ratio of 0.7. The authors note that although the IS Sharpe ratio performs reasonably well but not at a statistically significant threshold compared to the random portfolios, it nonetheless speaks well for the value of non-linear ML techniques when constructing portfolios of trading algorithms.
Key Takeaways
Together, these sobering results suggest that a reported Sharpe ratio (or related measure) based on backtest results alone can not be expected to prevail in future market environments with any reasonable confidence.
Overall, the authors find very weak correlations between the IS and OOS performance in most of the common metrics, including the ones focused on in this study: Sharpe ratio, Information ratio, alpha. However, in terms of the metrics predictability of OOS performance they found the Sharpe ratio calculated over the last IS year to have the highest predictability, a result corroborated by the Random Forest Regressor flagging this as one of the most predictive features as well.
The Random Forest Regressor indicates above the most predictive features. It is important to note that a few of the most important features are higher-order moments (skew, kurtosis, tail ratio), which may suggest that predictive information may be extracted in a non-linear and multivariate way.
In addition to the evidence of the strong correlation between volatility and backtest overfitting discussed above, the authors also make note of an increasing difference between IS and OOS performance as the number of completed backtests increases. This empirical evidence states a strong case for the effects of backtest overfitting.
As I was studying the paper it was easy to notice that the overall predictability of OOS performance was usually quite low as measured by R2, with all results having an R2 < 0.25. Essentially this can be interpreted that statistically less than 25% of the OOS performance is explained by the metric in question. The authors also noted this in the discussion section as additional evidence against the possibility of predicting the profitability OOS performance based on backtest data.
The Final Word
Thanks to the massive Quantopian algorithm database, this study sets a useful baseline for further research into the predictability of IS metrics to OOS performance. The study highlights empirical evidence for two key ideas:
- Avoid over-reliance on any single one algorithm performance metric, just as one should not pick a stock based solely on the P/E ratio!
- Avoid overfitting an algorithm to your backtests. We learned the importance of running less backtests, and focusing more of your development efforts into the initial research phase. This ensures the underlying ideas of the algorithm are stronger and helps mitigate the possibility of overfitting. A common additional safeguard is to keep a hold-out set of at least the most recent 6 months to run one final backtest at the end of the development process.