In machine learning, standard or vanilla cross validation forms the foundational basis testing the robustness of a model. It assesses the generalization ability of a model by splitting the dataset into a testing half and a training half, constructing it on the training half, and then implementing it on the training set. The primary use of cross validation as a tool is to prevent overfitting. While overfitting has many different connotations depending on the context, the concept remains the same: it is a model that has learnt all features of its training data including the noise and thus is unable to generalize new data and predict different outcomes. On the other hand, walk forward optimization reduces overfitting by testing each part of the data in a forward-looking manner, ensuring that there is no one particular lucky validation period for any data. Unlike traditional backtesting (which assumes that parameters remain effective indefinitely), this reflects how trades actually operate by continually reassessing parameters as new data becomes available. Additionally, each time period has a dual purpose of first of first being used as an out-of-sample validation period and then an in-sample optimization period. The main disadvantages of a walk forward optimization method are rested around its computational complexity requiring multiple rounds of optimization and validation, a window selection bias that determines how large your training datasets are, and making sure that regime shifts are accounted for in the model. The differentiation factor in finance comes from the fact that time series data operates differently in this field than in others. Any labelled data point in a financial time series has a trade time and an event time. The event time indicates when in the future the mark to market value of an asset reaches a certain level such that it becomes appropriate to exit the trade. In this context, it means that particular attention must be taken to ensure that data is not contaminated when a trade takes place in the testing period but is not exited until the training dataset. What this does is that allows the data a sneak peek into the future and thus makes the model look far better than it is. Since these are all path dependent labels, we must safeguard them. The solution to this — as de Prado writes in “Advances in Machine Learning” are the combined process of purging and embargoing. We first embargo the dataset (turn it into discrete intervals by taking a slice of time in the middle) and then we purge the dataset (removing the event times that overlap with the trade times in the test fold). In events like Kaggle competitions, this doesn’t really matter as data is already in a tabular form. To prove how cross validation lies, I wrote a piece of code that analyzed the daily logreturns of the S&P500 over a 10 year period from 2014 to 2024 and ran different regression models while comparing vanilla k-fold techniques, k-folding with purging and embargoes, and finally, walk forward optimization to see which was truly the most inferior option out of all of them at predicting future returns based on feature engineering. It examines if a sample long/ very long strategy could be fit onto the pricing system expounded by the model that we have built. Coming to the code, it is all uploaded to my Git however I did want to discuss the feature engineering section of it. While common to the point of being superfluous in many aspects, I used a combination of rolling volatility windows, lagged returns, bounded price, and RSI to determine short-term autocorrelation that points to any kind of momentum or mean reversion. I also used a distance to the bounds as a sort of range and attempted to capture seasonality with the use of calendar features. Even though I found a lot of the data to be noisy under a lot of scenarios, that may just be a skill issue and I might need to get good. Before we get to the results, I would like to discuss the different metrics helpers that I used to evaluate the model in the first place. MSE/MAE are obviously ubiquitous and in the case of daily returns, only there as a sanity check to make sure that the loss functions are appropriate. The spearman correlations, on the other hand, compare the rank between the predictions and the realized targets on the test set. They are important as they are monotonic in nature and used to order sizings for the next day's returns. On a cross sectional dataset, they can be extremely impressive but on a single series, they may just be noise. Lastly, I used hit rate as a long frequency estimator and to make sure that directional accuracy is maintained. The KFold (lie) looks pretty decent on the linear models which the Purged and Walk Forward models rectify the lie and show that there was no real edge actually present in the model. The shuffled KFold leaks the time structure as we discussed earlier and the IC < 0 —-> 0 from the other models verifies it as just a piece of noise. The RF HitRate also just always just an always long bias around a coin-flip benchmark (around 51-53%) which is just the standard practice of a market going more up than down. It makes ranking utterly useless in this regard. Thus, these kaggle style cross validation tricks work pretty swell when the data is ~ iid but financial time series data more often than not is a different language altogether. What one needs is a lot of patience and experience to not only know what the traps are, but how to avoid them altogether.