More

Mehdi2277 · 2025-01-21T02:07:40 1737425260

I'm very happy with pyright. Most bug reports are fixed within a week and new peps/features are added very rapidly usually before pep is accepted (under experimental flag). Enough that I ended up dropping pylint and consider pyright enough for lint purposes as well. The most valuable lints for my work require good multi-file/semantic analysis and pylint had various false positives.

Main tradeoff is this only works if your codebase/core dependencies are typed. For a while that was not true and we used pylint + pyright. Eventually most of our code is typed and we added type stubs for our main untyped dependencies.

edit: Also on pylint, it did work well mostly. tensorflow was main library that created most false positives. Other thing I found awkward was occasionally pylint produces non-deterministic lints on my codebase.

Mehdi2277 · 2024-08-21T19:42:59 1724269379

Uv is installer not a build backend. It’s similar to pip. If you install library with uv it will call backend like setuptools as needed. It is not a replacement for setuptools.

Mehdi2277 · 2024-08-04T02:00:54 1722736854

My experience working on ml at couple faang like companies is gpus actually tend to be too fast compute wise and often models are unable to come close to theoretical nvidia flops numbers. In that very frequently bottlenecks from profiling are elsewhere. It is very easy to have your data reading code be bottleneck. I have seen some models where our networking was bottleneck and could not keep up with the compute and we had adjust model architecture in ways to reduce amount of data transferred in training steps across the cluster. Or maybe you have gpu memory bandwidth as bottleneck. Key idea in flash attention work is optimizing attention kernels to lower amount of vram usage and stick to smaller/faster sram. This is valuable work, but is also kind of work that is pretty rare engineer I have worked with would have cuda kernel experience to create custom efficient kernels. Some of the models I train use a lot of sparse tensors as features and tensorflow’s sparse gpu kernel is rather bad with many operations either falling back to cpu or sometimes I have had gpu sparse kernel that was slower than cpu equivalent kernel. Several times densifying and padding tensors with large fraction of 0’s was faster than using sparse kernel.

I’m sure a few companies/models are optimized enough to fit ideal case but it’s rare.

Edit: Another aspect of this is nature of model architecture that are good today is very hardware driven. Major advantage of transformers over recurrent lstm models is training efficiency on gpu. The gap in training efficiency is much more dramatic with gpu than cpu for these two architectures. Similarly other architectures with sequential components like tree structured/recursive dynamic models tend to fit badly for gpu performance wise.

TheAlchemist · 2024-08-04T02:06:09 1722737169

Interesting, thanks.

Let me reframe the question - assume it's not only 100x GPUs, but all the performance bottlenecks you've mentioned are also solved or accelerated x100.

What kind of improvement would we observe, given the current state of the models and knowledge ?

Mehdi2277 · 2024-08-04T02:18:02 1722737882

If I assume you mean LLM like models similar to chatgpt that is pretty debated in the community. Several years ago many people in ML community believed we were at plateau and that throwing more compute/money would not give significant improvements. Well then LLMs did much better than expected as they scaled up and continue to iterate now on various benchmarks.

So are we now at performance plateau? I know people at openai like places that think AGI is likely in next 3-5 years and is mostly scaling up context/performance/a few other key bets away. I know others who think that is unlikely in next few decades.

My personal view is I would expect 100x speed up to make ML used even more broadly and to allow more companies outside big players to have there own foundation models tuned for their use cases or other specialized domain models outside language modeling. Even now I still see tabular datasets (recommender systems, pricing models, etc) as most common to work in industry jobs. As for impact 100x compute will have for leading models like openai/anthropic I honestly have little confidence what will happen.

The rest of this is very speculative and not sure of, but my personal gut is we still need other algorithmic improvements like better ways to represent storing memory that models can later query/search for, but honestly part of that is just math/cs background in me not wanting everything to end up being hardware problem. Other part is I’m doubtful human like intelligence is so compute expensive and we can’t find more cost efficient ways for models to learn but maybe our nervous system is just much faster at parallel computation?

HdS84 · 2024-08-04T06:41:48 1722753708

The human brain manages to work with 0.3 kWh per day - even if we say all of that is used for training "models" and for twenty years that's only 2200kwh - much less then what chat needed to train (500mwh?). So there are obviously lots of thinks we can do to improve efficiency. On the other hand, our brains hat hundreds of millions of years to be optimized for energy consumption.

thanksgiving · 2024-08-04T09:55:57 1722765357

A friend showed me some python code or something that demonstrates facial recognition by calculating the distance between facial features - eyes, nose...

I had never thought about this before but how do I recognize faces? I mostly recognize faces by context. And I don't have to match against a billion faces, probably a hundred or so? And I still suck at this.

The fact that human brain works with 0.3 kW per day likely doesn't mean much. How do we even start asking the question - is a human brain thermally (or resource in general) constrained?

hnaccount_rng · 2024-08-04T12:02:29 1722772949

The brain is responsible for about 1/5 of the total energy expenditure (and therefore food requirement)of a human body. So yes, on a biological level, there is significant resource constraints on a human brain. What is less clear is whether this actually holds for the “computing” part (as contrasted with the “sustainment”, think cell replacement, part)

throwaway48476 · 2024-08-04T10:14:51 1722766491

My brain is noticeably thermally constrained every summer.

throwaway48476 · 2024-08-04T10:13:45 1722766425

You're vastly underestimating the amount of data stored in the LLM weights compared to the amount of memory a human has.

chii · 2024-08-04T05:20:28 1722748828

There's some speculation that there are higher horizons to the training, as explained in this video: https://www.youtube.com/watch?v=Nvb_4Jj5kBo

the term for it is "grokking", amusingly. There's some indication that we are actually undertraining by 10x

Vecr · 2024-08-04T13:45:08 1722779108

I've seen improvement numbers up to 12x, but after that the returns are so diminishing that there's not really a point. 12x on training costs I mean, probably still won't get AGI.

theGnuMe · 2024-08-05T15:10:15 1722870615

realtime retraining and generation.

hnlmorg · 2024-08-04T11:39:18 1722771558

Well put. This was my experience when working for an AI start up too.

Frustratingly, it’s also the hardest part to solve too. Throwing more compute at the problem is easy but diagnosing and then solving those other bottlenecks takes a great deal of time and not to mention experience across a number of specialty domains that aren’t typically mutually inclusive.

Mehdi2277 · 2024-06-17T02:38:53 1718591933

I’ve worked at companies with async training. Async training does help on fault tolerance and also can assist with training thoroughput by being less reliant on slowest machine. It does add meaningful training noise and when we did experiments against sync training we got much more stable results with sync training and some of our less stable models would even sometimes have loss explosions/divergence issues with async training but be fine with sync training.

Although even for async training generally I see dataset just sharded and if worker goes down then shard of data may be loss/skipped not some kind of smarter dynamic file assignment factoring when workers go down. Even basic things like job fails continue from last checkpoint with same dataset state for large epoch is messy when major libraries like tensorflow lack a good dataset checkpointing mechanism.

Mehdi2277 · 2024-06-17T00:53:08 1718585588

This is fair guess on intuition but working in recommender space on both content/ad recommendation, content understanding signals have pretty consistently across two companies and many projects tended to underwhelm and key signals are generally engagement signals (including event sequences) and many embeddings (user embedding, creator embedding, ad embedding, etc).

The main place I’ve seen content understanding help is coldstart specially for new items by new creators.

xwolfi · 2024-06-17T03:26:04 1718594764

And you wonder sometimes if the products being advertised, themselves, couldn't matter more than the targets reading them. We've learned to recognize crappy offering and AI can try to make me read more and more ads relevant to what I'm saying, if it's a crap product, I won't pay anyway :(

Mehdi2277 · 2024-05-19T21:19:30 1716153570

I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.

So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.

Mehdi2277 · on Jan 26, 2024

Having used it heavily it is nowhere near painless. Where can you get a TPU? To train models you basically need to use GCP services. There are multiple services that offer TPU support, Cloud AI Platform, GKE, and Vertex AI. For GPU you can have a machine and run any tf version you like. For tpu you need different nodes depending on tf version. Which tf versions are supported per GCP service is inconsistent. Some versions are supported on Cloud AI Platform but not Vertex AI and vice versa. I have had a lot of difficulty trying to upgrade to recent tf versions and discovering the inconsistent service support.

Additionally many operations that run on GPU but are just unsupported for TPU. Sparse tensors have pretty limited support and there's bunch of models that will crash on TPU and require refactoring. Sometimes pretty heavy thousands of lines refactoring.

edit: Pytorch is even worse. Pytorch does not implement efficient tpu device data loading and generally has poor performance no where comparable to tensorflow/jax numbers. I'm unaware of any pytorch benchmarks where tpu actually wins. For tensorflow/jax if you can get it running and your model suits tpu assumptions (so basic CNN) then yes it can be cost effective. For pytorch even simple cases tend to lose.

Mehdi2277 · on Nov 20, 2023

23 comes from 2023. Pip uses calendar based versioning. The 23.3.1 means 2023, 3rd quarter, second release. The last number is for bug fix release. The first two numbers are purely date based and do not follow semvar.

selcuka · on Nov 20, 2023

D'oh. All those years I've never noticed that. Thanks for the info.

Mehdi2277 · on April 16, 2023

PEP 646 Variadic generics, https://peps.python.org/pep-0646/, was made for this specific use case but mypy is still working on implementing it. And even with it, it's expected several more peps are needed to make operations on variadic types powerful enough to handle common array operations. numpy/tensorflow/etc do broadcasting a lot and that probably would need a type level operator Broadcast just to encode that. I also expect the type definitions for numpy will go fairly complex similar to template heavy C++ code after they add shape types.

Mehdi2277 · on March 31, 2023

Having worked at similar companies on similar systems usually A/B experiments and smaller probability of an action bigger weight it must have to matter much overall. The constants are generally done through some ab tests to get them into reasonable overall behavior but they are a pain to tune and very unlikely optimal in any real sense as it’s often too difficult to do extensive search of them. Like often I’ll see new target have a couple different weights tried on an ab and then maybe second set of experiments after rough magnitude is determined.