Gah, this paper is hard to read, but here's my understanding:
Let's say you have 100 intersections, and you want to predict the traffic on each in cars/sec. You sample every hour, and you keep 24 hours of context, and try to predict the next 4.
First, you'd make 100 "tokens" (really stretching the meaning of token here), one for each stoplight, and loading 24 samples (the history of that stoplight) into each token, and normalize.
Next, you run each token through a Multi-Layer Perceptron (vanilla, old-school neural network) to make a vector of dim D.
Next, for each layer of the transformer, you:
1. Perform "cross-attention," i.e. the query/key/value dance. This is how the different time series (erm, tokens) get to share information.
2. Normalize across all.
3. Run another bog-standard MLP independently on each token. This is the opportunity to examine the history of each time series.
4. Normalize again across all.
Then, you map each "token" (ugh) from being D-dimensional to 4-dimensional, so for each stoplight it predicts the traffic ahead for the next 4 hours. This is also a regular MLP.
So specifically, if you're only predicting a single time series (one stoplight), this method is equivalent to running a regular neural network.
It also, interestingly enough, skips the cool sinusoidal position embedding that transformers use to embed token position. Fair enough, since here the time dimension is fixed and the index of the feed-forward neurons in each MLP layer corresponds (roughly) to the time index of the sample.
The architecture looks weird to me, but apparently it works so that's cool! But I'm not sure how well it works, and my unscientific gut feel is that there's a better and simpler architecture crying out to be found, because this looks a bit tortured. Like, nothing in it explicitly models the time dimension - that task is left to the MLPs - and that seems weird.
I had a startup a few years ago that was in the “eh we’ve got some money left from our BigTech days, let’s buy a lottery ticket that’s also a masters degree” category.
And in late 2018, attention/transformers was quite the risqué idea. We were trying to forecast price action in financial markets, and while it didn’t work (I mean really Ben), it smoked all the published stuff like DeepLOB.
It used learned embeddings of raw order books passed through a little conv widget to smooth a bit, and then learned embeddings of order book states before passing them through big-standard positional encoding and multi-head masked self-attention.
This actually worked great!
The thing that kills you is trying to reward-shape on the policy side to avoid getting eaten by taker fees, but it’s a broken ATM with artificially lowered fees.
Interesting, I'm trying to understand (much less knowledgeable about finance than ML, heh.) But it sounds like you fed it the raw order books (no time dimension), a sequence of order states corresponding to each (a time series), mapped them into the embedding dimension of a decoder-only transformer (the masking), and trained it to predict logits for the next order state?
See, that makes way more sense to me, since it sounds like you used causal self-attention , and actual position embeddings.
I've been interested in some time series stuff, like position embeddings to model actual wall-clock time offsets rather than sequence index, but for textless NLP rather than trading.
They did say it didn't work. The overwhelming majority of finance stuff in published work doesn't work, because they're either too simplistic, poorly backtested, or they get exploited too quickly, so beating those doesn't imply you can run a hedge fund.
The main part here is that it's one thing to predict price action, it's another thing to trade profitably - and in particular they were not able to beat fees, which is a common hurdle if you're new to HFT.
Basically this. We were heavy infra pros and my cofounder was an HFT veteran so it wasn’t classic implementation shortfall so much as we didn’t solve the “do we enter” threshold on what would be a friction-free windfall.
What they describe looks like a single predictor. You can't create a strategy with a single predictor, unless it's incredibly predictive. 99% of the time, a predictor cannot beat its transaction costs alone.
You need to combine hundreds of such predictors to be able to beat costs and have a net profitable strategy.
We have a saying in French that you need a lot of rivers to create a sea.
So the group involved veterans from like Knight and DRW and stuff: we understood the model of combining lots of small signals with a low-latency regression.
We were trying to learn those signals as opposed to sweat-shop them.
Wasn't the US housing crisis of the late 2000s caused by that 99% threshold?
Not in finance at all but I do use reverse Kalman filters, to which this seems similar in core concepts.
While reverse Kalman filters are incredibly helpful in reducing cloud spends by predicting when to auto scale, you still have to have metrics to quickly recover from mistakes.
Based only on tech interviews with HFT companies, I would assume someone could predict your moves using these methods based on historical data.
But perhaps I am just too risk adverse or am missing the core concept.
Doesn't this presuppose that all the information you need to predict the future of your time series is embedded in the past of those time series?
Don't most time series we would be interested in predicting (weather, prices, traffic volumes) tend to respond to things outside the history of the time series in question?
Or is the thesis here that we throw every random time series we can think of - wave height series from buoys in the San Francisco Bay, ticket sales from Taylor Swift concerts, Teslas per hour in the Holland tunnel, sales volume of MSFT... and get this thing to find the cross-correlated leading indicators needed so it can predict them all?
> Doesn't this presuppose that all the information you need to predict the future of your time series is embedded in the past of those time series?
Yes. But usually this is somewhat valid: There might not be data about the causes in your data, but the model should learn not be be over confident.
> Don't most time series we would be interested in predicting (weather, prices, traffic volumes) tend to respond to things outside the history of the time series in question?
Yes and no.
You really want the forecast to be a probability distribution: 95% of the time it will take you X minutes to get home from work if you leave at 17:30 but 5% of the time there will be disruptions.
It seems that crossformer has a very large number of tokens (patch as tokens). The author of this paper believes that one variable corresponds to one token is sufficient, and it is natural to use attention to describe their overall relationship among these individual entities.
I think this is a very similar concept compared to TiDE: https://arxiv.org/abs/2304.08424 that also came before and is linked in the paper mentioned in this post. I didn't read through the paper, so I can't point out the differences in approach yet.
However, by just looking at this post' paper results, it seems that at least for TiDE they reported the results completely different from the original paper. It seems this is cherry-picking the particular configuration as the delta is a bit too much to just blame un-reproducibility.
I also agree that modeling the time dimension by MLP can be more rational than self-attn. It learns the weightings from time points with the same physical meaning.
Let's say you have 100 intersections, and you want to predict the traffic on each in cars/sec. You sample every hour, and you keep 24 hours of context, and try to predict the next 4.
First, you'd make 100 "tokens" (really stretching the meaning of token here), one for each stoplight, and loading 24 samples (the history of that stoplight) into each token, and normalize.
Next, you run each token through a Multi-Layer Perceptron (vanilla, old-school neural network) to make a vector of dim D.
Next, for each layer of the transformer, you: 1. Perform "cross-attention," i.e. the query/key/value dance. This is how the different time series (erm, tokens) get to share information. 2. Normalize across all. 3. Run another bog-standard MLP independently on each token. This is the opportunity to examine the history of each time series. 4. Normalize again across all.
Then, you map each "token" (ugh) from being D-dimensional to 4-dimensional, so for each stoplight it predicts the traffic ahead for the next 4 hours. This is also a regular MLP.
So specifically, if you're only predicting a single time series (one stoplight), this method is equivalent to running a regular neural network.
It also, interestingly enough, skips the cool sinusoidal position embedding that transformers use to embed token position. Fair enough, since here the time dimension is fixed and the index of the feed-forward neurons in each MLP layer corresponds (roughly) to the time index of the sample.
The architecture looks weird to me, but apparently it works so that's cool! But I'm not sure how well it works, and my unscientific gut feel is that there's a better and simpler architecture crying out to be found, because this looks a bit tortured. Like, nothing in it explicitly models the time dimension - that task is left to the MLPs - and that seems weird.