Real-Time Market Data Forecasting with Transformer Models
A new approach to predicting high-frequency trading patterns
I recently got into this Jane Street competition on Kaggle and realized that this might be a good chance to explore how a cognitive co-pilot could support me on my journey of understanding the stock market more deeply. The stock market is a volatile, dynamic, and nonlinear primordial void of data and liquidity, devouring careless traders in an endless cycle of consumption that makes accurate price predictions nearly impossible. The vast data offers hope for analysts, researchers, and data scientists like me to uncover hidden patterns and pursue the elusive alpha
Traditional quantitative finance methodologies and machine learning algorithms, such as Moving Average techniques, ARIMA, and Long Short-Term Memory (LSTM) networks have been shown to work effectively over the last decade.
source
However, LSTM and the other more traditional forecasting models have also several disadvantages :
LSTMs are difficult to parallelizable making it much slower to train since they must be processed sequentially.,
LSTM has no explicit modeling of long and short-range dependencies, and
For LSTM the “distance” between positions is linear. That means that distant memories do weigh not as heavily as recent ones.
LSTMs are typically fitted individually on each time series and are therefore referred to as "single" or "local" methods.
Since Transformer is a global sequence learner, I was wondering if there is a way to use them to forecast short-term market prices. Just to be clear, by short-term I mean what might happen within the next 24 hours and not in the next 90 days. That also means that this post might be seen as a post that glorifies day trading. It is not. While there might be highly successful day traders, I am in general not a big fan of that approach. But I do have an unusual volume signal generator that I want to use maybe there is a way to predict the next swing and that always starts with what might happen in the next 24 hours. Or in other words, what will happen next?
In the above example, the stock crashed due to an earnings call event. This post is not about forecasting earnings or news events. It’s about having a model that learns what happened in similar situations in the past and then predicts the next move.
LSTMs, with their recurrent structure, were initially pioneered to capture long-range dependencies in sequential data. However, training LSTM models takes a long time because of the fact that they must remember all past observances.
My hypothesis is that the Transformer’s attention model fixes most of these issues elegantly.
Transformer
To briefly recap. Transformer is a neural network architecture that has upended the field of machine learning, particularly in natural language processing (NLP) since the seminal 2017 "Attention is All You Need" paper. Transformer utilizes a sequence-to-sequence architecture that effectively transforms a given tokenized sequence of elements, such as words in a sentence, into another sequence. In Graph Encoding, I already introduced techniques and explained the importance of encoding the state of the world into a representation the model can work with. In this case, we will be understanding each row in the dataset as one sentence. Therefore, we can understand the history of a stock as a story that is only told in quantitative terms.
The Transformer architecture then leverages its self-attention mechanisms to process the input data as a whole, rather than sequentially as in traditional models.
The self-attention model visually explained
The self-attention mechanism in transformers allows each element in the input sequence to attend to all other elements, providing a global context that enhances the model's ability to predict the next token and includes long-range dependencies.
Long-range dependencies?
Consider how major product announcements affect Apple's stock price. When Apple announces a new iPhone in September, it often creates a pattern that impacts the stock maybe not just immediately, but over several months:
Initial spike during the announcement
Build-up during pre-orders
Holiday season sales performance
Quarterly earnings reflecting phone sales
Supply chain reports months later
So if you're trying to forecast Apple's stock price in March, you might need to consider the previous September's iPhone announcement and all these subsequent events - this is a long-range dependency spanning 6+ months. The impact of such events on financial markets is well-documented, with studies showing that markets often react significantly to government partisanship, elections, and geopolitical risks. These events are considered exogenous shocks that can cause abrupt changes in market trends. For instance, research has shown that stock returns are influenced by political decisions and elections and that market responses to such events can be effectively used in a transformer-based model.
Forecasting
Forecasting stock prices is extremely challenging, many much smarter people than me have tried and failed. Especially when incorporating external factors such as political events, natural disasters, disruptive transformative innovation, and economic conditions. Transformers have the potential to be more effective by incorporating these complexities in their prediction and, maybe more importantly, to distinguish which signals matter and which ones don’t. Again, this is not forecasting the event, but understanding the reaction of that specific stock under a certain set of conditions.
Transformer integrates such conditions into their prediction model through the attention and memory mechanism. The TimeXer model exemplifies this approach by empowering transformers for time series forecasting with exogenous variables, enabling the model to adapt to rapid and unpredictable changes in the financial landscape.
But what, there is more. Transformer is also capable of handling large datasets more efficiently due to its parallel processing capabilities, which can significantly speed up the training and inference times compared to RNNs and LSTMs. And, as mentioned already, its self-attention mechanism is adept at focusing on the most relevant parts of the data, thereby improving the model's ability to make accurate predictions despite the noise and rapid fluctuations typical of financial market patterns. This post is also not about learning technical analysis patterns like head and shoulders. I don’t believe in that.
Patterns
All marked patterns below are forms of mean reversion. It is apparent, however, that the sequences that lead to the corresponding mean-reversion patterns are vastly different. Naturally, this “before” pattern would be helpful to the machine in its attempt to classify what comes next. In order to effectively settle on a predictive pattern, the Transformer model I am designing attempts to infer a sequence of ebbs and flows that have historically been proven predictive. This goes toward any time series patterns of any value that fluctuates over time. (For example, measuring alternative data in the form of social media sentiment towards a stock over time.)
source
I believe that at most of these points in the chart, our Transformer model should have a clear opinion of what will happen next. Transformers like the Autoformer, Informer, and Fredformer have shown substantial advancements in processing and forecasting time-series data by addressing specific challenges such as frequency bias and the need for efficient computation, which are common in especially high-frequency financial data. This type of data I will use for the Kaggle challenge.
Risks
However, we will have to realize that this will be a story where over the course of a day not much will change. Therefore, we can expect a lot of repetitions in the data. If we want speed, then encoding these features in an efficient way should be part of the design story. Also, data quality will be a significant driver of model quality.
source
Looking at the screenshot above, we can see a lot of missing data, and because of this missing data, strange or meaningless ratios. This is even worse for low-liquidity, high-volatility penny stocks. Therefore, the data preprocessing phase is ultimately crucial for maintaining data integrity when working with this data. Missing values, which are common in stock data due to non-trading days or incomplete records, are handled using strategies such as filling gaps with the average values of surrounding days. This method ensures continuity and reliability in the dataset, which is essential for accurate model training and prediction.
I wrote several times throughout that trading data is noisy. What does that mean?
Noisy market data can consist of market manipulation attempts, emotional trading reactions, program trading algorithms interacting with each other, rumors, different time horizons, bid-ask bounce, corporate actions, secondary or tertiary events, technical glitches, market maker activities, order book dynamics, or data errors.
Ultimately, what I have in mind is orchestrating something like this one from Lucena Research.
source
In closing
Transformer is already revolutionizing the way I work when writing and researching articles, and it looks like there is potential to get an interesting pilot for this off the ground soon that might also improve my trading. The Kaggle contest seems like a cool idea to play around with this approach using Jane Street data, which I assume to be of high quality. But we will find this out soon.
I hope you found this post interesting, please like, share, and subscribe.