In this article we’ll take a look at two related practices that are widely used by traders called Backtesting and Data Mining. These are techniques that are powerful and valuable if we use them correctly, however traders often misuse them. Therefore, we’ll also explore two common pitfalls of these techniques, known as the multiple hypothesis problem and overfitting and how to overcome these pitfalls.
Backtesting is just the process of using historical data to test the performance of some trading strategy. Backtesting generally starts with a strategy that we would like to test, for instance buying GBP/USD when it crosses above the 20-day moving average and selling when it crosses below that average. Now we could test that strategy by watching what the market does going forward, but that would take a long time. This is why we use historical data that is already available.
"But wait, wait!" I hear you say. "Couldn’t you cheat or at least be biased because you already know what happened in the past?" That’s definitely a concern, so a valid backtest will be one in which we aren’t familiar with the historical data. We can accomplish this by choosing random time periods or by choosing many different time periods in which to conduct the test.
Now I can hear another group of you saying, "But all that historical data just sitting there waiting to be analyzed is tempting isn’t it? Maybe there are profound secrets in that data just waiting for geeks like us to discover it. Would it be so wrong for us to examine that historical data first, to analyze it and see if we can find patterns hidden within it?" This argument is also valid, but it leads us into an area fraught with danger…the world of Data Mining
Data Mining involves searching through data in order to locate patterns and find possible correlations between variables. In the example above involving the 20-day moving average strategy, we just came up with that particular indicator out of the blue, but suppose we had no idea what type of strategy we wanted to test? That’s when data mining comes in handy. We could search through our historical data on GBP/USD to see how the price behaved after it crossed many different moving averages. We could check price movements against many other types of indicators as well and see which ones correspond to large price movements.
The subject of data mining can be controversial because as I discussed above it seems a bit like cheating or "looking ahead" in the data. Is data mining a valid scientific technique? On the one hand the scientific method says that we’re supposed to make a hypothesis first and then test it against our data, but on the other hand it seems appropriate to do some "exploration" of the data first in order to suggest a hypothesis. So which is right? We can look at the steps in the Scientific Method for a clue to the source of the confusion. The process in general looks like this:
Observation (data) >>> Hypothesis >>> Prediction >>> Experiment (data)
Notice that we can deal with data during both the Observation and Experiment stages. So both views are right. We must use data in order to create a sensible hypothesis, but we also test that hypothesis using data. The trick is simply to make sure that the two sets of data are not the same! We must never test our hypothesis using the same set of data that we used to suggest our hypothesis. In other words, if you use data mining in order to come up with strategy ideas, make sure you use a different set of data to backtest those ideas.
Now we’ll turn our attention to the main pitfalls of using data mining and backtesting incorrectly. The general problem is known as "over-optimization" and I prefer to break that problem down into two distinct types. These are the multiple hypothesis problem and overfitting. In a sense they are opposite ways of making the same error. The multiple hypothesis problem involves choosing many simple hypotheses while overfitting involves the creation of one very complex hypothesis.
The Multiple Hypothesis Problem
To see how this problem arises, let’s go back to our example where we backtested the 20-day moving average strategy. Let’s suppose that we backtest the strategy against ten years of historical market data and lo and behold guess what? The results are not very encouraging. However, being rough and tumble traders as we are, we decide not to give up so easily. What about a ten day moving average? That might work out a little better, so let’s backtest it! We run another backtest and we find that the results still aren’t stellar, but they’re a bit better than the 20-day results. We decide to explore a little and run similar tests with 5-day and 30-day moving averages. Finally it occurs to us that we could actually just test every single moving average up to some point and see how they all perform. So we test the 2-day, 3-day, 4-day, and so on, all the way up to the 50-day moving average.
Now certainly some of these averages will perform poorly and others will perform fairly well, but there will have to be one of them which is the absolute best. For instance we may find that the 32-day moving average turned out to be the best performer during this particular ten year period. Does this mean that there is something special about the 32-day average and that we should be confident that it will perform well in the future? Unfortunately many traders assume this to be the case, and they just stop their analysis at this point, thinking that they’ve discovered something profound. They have fallen into the "Multiple Hypothesis Problem" pitfall.
The problem is that there is nothing at all unusual or significant about the fact that some average turned out to be the best. After all, we tested almost fifty of them against the same data, so we’d expect to find a few good performers, just by chance. It doesn’t mean there’s anything special about the particular moving average that "won" in this case. The problem arises because we tested multiple hypotheses until we found one that worked, instead of choosing a single hypothesis and testing it.
Here’s a good classic analogy. We could come up with a single hypothesis such as "Scott is great at flipping heads on a coin." From that, we could create a prediction that says, "If the hypothesis is true, Scott will be able to flip 10 heads in a row." Then we can perform a simple experiment to test that hypothesis. If I can flip 10 heads in a row it actually doesn’t prove the hypothesis. However if I can’t accomplish this feat it definitely disproves the hypothesis. As we do repeated experiments which fail to disprove the hypothesis, then our confidence in its truth grows.
That’s the right way to do it. However, what if we had come up with 1,000 hypotheses instead of just the one about me being a good coin flipper? We could make the same hypothesis about 1,000 different people…me, Ed, Cindy, Bill, Sam, etc. Ok, now let’s test our multiple hypotheses. We ask all 1000 people to flip a coin. There will probably be about 500 who flip heads. Everyone else can go home. Now we ask those 500 people to flip again, and this time about 250 will flip heads. On the third flip about 125 people flip heads, on the fourth about 63 people are left, and on the fifth flip there are about 32. These 32 people are all pretty amazing aren’t they? They’ve all flipped five heads in a row! If we flip five more times and eliminate half the people each time on average, we will end up with 16, then 8, then 4, then 2 and finally one person left who has flipped ten heads in a row. It’s Bill! Bill is a "fantabulous" flipper of coins! Or is he?
Well we really don’t know, and that’s the point. Bill may have won our contest out of pure chance, or he may very well be the best flipper of heads this side of the Andromeda galaxy. By the same token, we don’t know if the 32-day moving average from our example above just performed well in our test by pure chance, or if there is really something special about it. But all we’ve done so far is to find a hypothesis, namely that the 32-day moving average strategy is profitable (or that Bill is a great coin flipper). We haven’t actually tested that hypothesis yet.
So now that we understand that we haven’t really discovered anything significant yet about the 32-day moving average or about Bill’s ability to flip coins, the natural question to ask is what should we do next? As I mentioned above, many traders never realize that there is a next step required at all. Well, in the case of Bill you’d probably ask, "Aha, but can he flip ten heads in a row again?" In the case of the 32-day moving average, we’d want to test it again, but certainly not against the same data sample that we used to choose that hypothesis. We would choose another ten-year period and see if the strategy worked just as well. We could continue to do this experiment as many times as we wanted until our supply of new ten-year periods ran out. We refer to this as "out of sample testing", and it’s the way to avoid this pitfall. There are various methods of such testing, one of which is "cross validation", but we won’t get into that much detail here.
Overfitting is really a kind of reversal of the above problem. In the multiple hypothesis example above, we looked at many simple hypotheses and picked the one that performed best in the past. In overfitting we first look at the past and then construct a single complex hypothesis that fits well with what happened. For example if I look at the USD/JPY rate over the past 10 days, I might see that the daily closes did this:
up, up, down, up, up, up, down, down, down, up.
Got it? See the pattern? Yeah, neither do I actually. But if I wanted to use this data to suggest a hypothesis, I might come up with…
My amazing hypothesis:
If the closing price goes up twice in a row then down for one day, or if it goes down for three days in a row we should buy,
but if the closing price goes up three days in a row we should sell,
but if it goes up three days in a row and then down three days in a row we should buy.
Huh? Sounds like a whacky hypothesis right? But if we had used this strategy over the past 10 days, we would have been right on every single trade we made! The "overfitter" uses backtesting and data mining differently than the "multiple hypothesis makers" do. The "overfitter" doesn’t come up with 400 different strategies to backtest. No way! The "overfitter" uses data mining tools to figure out just one strategy, no matter how complex, that would have had the best performance over the backtesting period. Will it work in the future?
Not likely, but we could always keep tweaking the model and testing the strategy in different samples (out of sample testing again) to see if our performance improves. When we stop getting performance improvements and the only thing that’s rising is the complexity of our model, then we know we’ve crossed the line into overfitting.
So in summary, we’ve seen that data mining is a way to use our historical price data to suggest a workable trading strategy, but that we have to be aware of the pitfalls of the multiple hypothesis problem and overfitting. The way to make sure that we don’t fall prey to these pitfalls is to backtest our strategy using a different dataset than the one we used during our data mining exploration. We commonly refer to this as "out of sample testing".