Gah, this paper is hard to read, but here's my understanding: Let's say you have...

benreesman · on Oct 12, 2023

I had a startup a few years ago that was in the “eh we’ve got some money left from our BigTech days, let’s buy a lottery ticket that’s also a masters degree” category.

And in late 2018, attention/transformers was quite the risqué idea. We were trying to forecast price action in financial markets, and while it didn’t work (I mean really Ben), it smoked all the published stuff like DeepLOB.

It used learned embeddings of raw order books passed through a little conv widget to smooth a bit, and then learned embeddings of order book states before passing them through big-standard positional encoding and multi-head masked self-attention.

This actually worked great!

The thing that kills you is trying to reward-shape on the policy side to avoid getting eaten by taker fees, but it’s a broken ATM with artificially lowered fees.

sterlind · on Oct 12, 2023

Interesting, I'm trying to understand (much less knowledgeable about finance than ML, heh.) But it sounds like you fed it the raw order books (no time dimension), a sequence of order states corresponding to each (a time series), mapped them into the embedding dimension of a decoder-only transformer (the masking), and trained it to predict logits for the next order state?

See, that makes way more sense to me, since it sounds like you used causal self-attention , and actual position embeddings.

I've been interested in some time series stuff, like position embeddings to model actual wall-clock time offsets rather than sequence index, but for textless NLP rather than trading.

anonymoushn · on Oct 12, 2023

XTX markets seems to be doing something in the same genre as this. As I understand it, they are mostly taker.

anonu · on Oct 12, 2023

why didnt you start a hedge fund?

dchftcs · on Oct 12, 2023

They did say it didn't work. The overwhelming majority of finance stuff in published work doesn't work, because they're either too simplistic, poorly backtested, or they get exploited too quickly, so beating those doesn't imply you can run a hedge fund.

The main part here is that it's one thing to predict price action, it's another thing to trade profitably - and in particular they were not able to beat fees, which is a common hurdle if you're new to HFT.

benreesman · on Oct 12, 2023

Basically this. We were heavy infra pros and my cofounder was an HFT veteran so it wasn’t classic implementation shortfall so much as we didn’t solve the “do we enter” threshold on what would be a friction-free windfall.

Galanwe · on Oct 12, 2023

What they describe looks like a single predictor. You can't create a strategy with a single predictor, unless it's incredibly predictive. 99% of the time, a predictor cannot beat its transaction costs alone.

You need to combine hundreds of such predictors to be able to beat costs and have a net profitable strategy.

We have a saying in French that you need a lot of rivers to create a sea.

benreesman · on Oct 12, 2023

So the group involved veterans from like Knight and DRW and stuff: we understood the model of combining lots of small signals with a low-latency regression.

We were trying to learn those signals as opposed to sweat-shop them.

But the broader point holds: signal isn’t alpha.

nyrikki · on Oct 12, 2023

Wasn't the US housing crisis of the late 2000s caused by that 99% threshold?

Not in finance at all but I do use reverse Kalman filters, to which this seems similar in core concepts.

While reverse Kalman filters are incredibly helpful in reducing cloud spends by predicting when to auto scale, you still have to have metrics to quickly recover from mistakes.

Based only on tech interviews with HFT companies, I would assume someone could predict your moves using these methods based on historical data.

But perhaps I am just too risk adverse or am missing the core concept.

benreesman · on Oct 12, 2023

You might be referring to the Gaussian coppula bullshit Dave Li did? [1]

[1] https://en.m.wikipedia.org/wiki/David_X._Li

jameshart · on Oct 12, 2023

Doesn't this presuppose that all the information you need to predict the future of your time series is embedded in the past of those time series?

Don't most time series we would be interested in predicting (weather, prices, traffic volumes) tend to respond to things outside the history of the time series in question?

Or is the thesis here that we throw every random time series we can think of - wave height series from buoys in the San Francisco Bay, ticket sales from Taylor Swift concerts, Teslas per hour in the Holland tunnel, sales volume of MSFT... and get this thing to find the cross-correlated leading indicators needed so it can predict them all?

nl · on Oct 12, 2023

> Doesn't this presuppose that all the information you need to predict the future of your time series is embedded in the past of those time series?

Yes. But usually this is somewhat valid: There might not be data about the causes in your data, but the model should learn not be be over confident.

> Don't most time series we would be interested in predicting (weather, prices, traffic volumes) tend to respond to things outside the history of the time series in question?

Yes and no.

You really want the forecast to be a probability distribution: 95% of the time it will take you X minutes to get home from work if you leave at 17:30 but 5% of the time there will be disruptions.

mirekrusin · on Oct 12, 2023

Big part of it is historic dice tosses that create mirage of data just waiting to be tamed.

starbucker · on Oct 14, 2023

"I have seen the future and it is very much like the present, only longer."

gigatexal · on Oct 11, 2023

Thank you! Thank you for explaining it in simpler terms. I get about 5% out of these papers but I got a lot more out of this break down.

fnordpiglet · on Oct 11, 2023

I find crossformers easier to track:

https://openreview.net/forum?id=vSVLM2j9eie

lonelydance · on Oct 14, 2023

It seems that crossformer has a very large number of tokens (patch as tokens). The author of this paper believes that one variable corresponds to one token is sufficient, and it is natural to use attention to describe their overall relationship among these individual entities.

marcyb5st · on Oct 12, 2023

I think this is a very similar concept compared to TiDE: https://arxiv.org/abs/2304.08424 that also came before and is linked in the paper mentioned in this post. I didn't read through the paper, so I can't point out the differences in approach yet.

However, by just looking at this post' paper results, it seems that at least for TiDE they reported the results completely different from the original paper. It seems this is cherry-picking the particular configuration as the delta is a bit too much to just blame un-reproducibility.

wenwei · on Oct 14, 2023

In the TIDE paper, the input sequence length is tuned, while this work uses the uniform input length.

sunnacoper · on Oct 16, 2023

Cool! I find it has been implemented in the tslib (https://github.com/thuml/Time-Series-Library), the results seem promising when I reproduce the experiments.

genshin · on Oct 14, 2023

I also agree that modeling the time dimension by MLP can be more rational than self-attn. It learns the weightings from time points with the same physical meaning.

AndrewKemendo · on Oct 12, 2023

Really really great write up!

Thank you and yes this is very exciting

People are starting to really decompose the transformer architecture and I’m excited to see how far it can go