QuantyFrog | I Can Fix Him (Overfitting)

In machine learning, standard or vanilla cross validation forms the foundational basis
testing the robustness of a model. It assesses the generalization ability of a model by splitting the
dataset into a testing half and a training half, constructing it on the training half, and then
implementing it on the training set. The primary use of cross validation as a tool is to prevent
overfitting. While overfitting has many different connotations depending on the context, the
concept remains the same: it is a model that has learnt all features of its training data including
the noise and thus is unable to generalize new data and predict different outcomes.
On the other hand, walk forward optimization reduces overfitting by testing each part of
the data in a forward-looking manner, ensuring that there is no one particular lucky validation
period for any data. Unlike traditional backtesting (which assumes that parameters remain
effective indefinitely), this reflects how trades actually operate by continually reassessing
parameters as new data becomes available. Additionally, each time period has a dual purpose of
first of first being used as an out-of-sample validation period and then an in-sample optimization
period. The main disadvantages of a walk forward optimization method are rested around its
computational complexity requiring multiple rounds of optimization and validation, a window
selection bias that determines how large your training datasets are, and making sure that regime
shifts are accounted for in the model.

The differentiation factor in finance comes from the fact that time series data operates
differently in this field than in others. Any labelled data point in a financial time series has a
trade time and an event time. The event time indicates when in the future the mark to market
value of an asset reaches a certain level such that it becomes appropriate to exit the trade. In this
context, it means that particular attention must be taken to ensure that data is not contaminated
when a trade takes place in the testing period but is not exited until the training dataset. What this
does is that allows the data a sneak peek into the future and thus makes the model look far better
than it is. Since these are all path dependent labels, we must safeguard them. The solution to this
— as de Prado writes in “Advances in Machine Learning” are the combined process of purging
and embargoing. We first embargo the dataset (turn it into discrete intervals by taking a slice of
time in the middle) and then we purge the dataset (removing the event times that overlap with the
trade times in the test fold). In events like Kaggle competitions, this doesn’t really matter as data
is already in a tabular form.

To prove how cross validation lies, I wrote a piece of code that analyzed the daily logreturns of
the S&P500 over a 10 year period from 2014 to 2024 and ran different
regression models while comparing vanilla k-fold techniques, k-folding with purging and
embargoes, and finally, walk forward optimization to see which was truly the most inferior
option out of all of them at predicting future returns based on feature engineering. It examines if
a sample long/ very long strategy could be fit onto the pricing system expounded by the model
that we have built. Coming to the code, it is all uploaded to my Git however I did want to discuss
the feature engineering section of it. While common to the point of being superfluous in many
aspects, I used a combination of rolling volatility windows, lagged returns, bounded price, and
RSI to determine short-term autocorrelation that points to any kind of momentum or mean
reversion. I also used a distance to the bounds as a sort of range and attempted to capture
seasonality with the use of calendar features. Even though I found a lot of the data to be noisy
under a lot of scenarios, that may just be a skill issue and I might need to get good.
Before we get to the results, I would like to discuss the different metrics helpers that I
used to evaluate the model in the first place. MSE/MAE are obviously ubiquitous and in the case
of daily returns, only there as a sanity check to make sure that the loss functions are appropriate.
The spearman correlations, on the other hand, compare the rank between the predictions and the
realized targets on the test set. They are important as they are monotonic in nature and used to
order sizings for the next day's returns. On a cross sectional dataset, they can be extremely
impressive but on a single series, they may just be noise. Lastly, I used hit rate as a long
frequency estimator and to make sure that directional accuracy is maintained.
The KFold (lie) looks pretty decent on the linear models which the Purged and Walk
Forward models rectify the lie and show that there was no real edge actually present in the
model. The shuffled KFold leaks the time structure as we discussed earlier and the IC < 0 —-> 0
from the other models verifies it as just a piece of noise. The RF HitRate also just always just an
always long bias around a coin-flip benchmark (around 51-53%) which is just the standard
practice of a market going more up than down. It makes ranking utterly useless in this regard.
Thus, these kaggle style cross validation tricks work pretty swell when the data is ~ iid
but financial time series data more often than not is a different language altogether. What one
needs is a lot of patience and experience to not only know what the traps are, but how to avoid
them altogether.