Unlike quantization, dimensionality reduction/low rank approximation, distillation etc, lossless compression is an always-correct addition to any ML system as you are computing the same thing you did before, the only question is if it is fast enough to not cause substantial bottlenecks and if the achievable compression ratio is high enough to be useful.
Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.
If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.
Except it's being used in a situation where correctness isn't important. A close approximation is more than fine. In fact, an approximation might be better because it's more generalizable.
Hence, it's a bs thing to say. And it sounds clever - the worst type of bs.
Not really, it's just adding some data transposition (coalescing individual bytes from the data words together) and an option to use a LZ/dictionary-type compressor to compress redundant things. But an LZ-type compressor doesn't make much sense on NN weights I think since it is not as redundant as most text data with many repeats, and also the space of possible dictionary matches is pretty small since unless the data is highly sparse, there may not be many repetitions that you can leverage to avoid the dictionary overhead.
If you add an LZ-type compressor and have this be in the critical path for inference, then decompression will be a lot slower. It would be best to fuse decompression with the compute kernels (e.g., a GEMM that performs decompression on each tile before the arithmetic), and the simpler the decompression routine, the easier this will be.
This is just a consequence of the fact that bfloat16 has a very high dynamic range which is not all used. People like hyperparameters that look like 0.01 not 10^10, even though there is the same fractional precision available at each exponent and if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions).
Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.
This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.
Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).
I really love HN for this reason. Full of some of the brightest minds on the internet. Often the comments have very interesting information, instead of stupid knee jerk reactions to post titles.
Thanks Jeff -- can you point me to something written up about rANS? All I find on line is turbulence modeling solutions; I presume this is not what you're referring to.
As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?
The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.
> if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions)
I doubt that very much. Thing is that inputs are multiplied with weights and added together in a neural network layer, and then the output becomes the input of the next layer in a cycle that can repeat up to a hundred times or more. When you get to the final output layer that 10^6 factor has been applied so many times that it has snowballed to a 10^600 factor.
The Deepseek v3 paper details a quantisation method of scaling after matmul but before accumulation to improve precision, this is different than normal GEMM as operations are left till the end, can read more in chapter 3.3 of the paper below.
Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).
Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.
Classic comp sci tradeoff between space and speed, no free lunch, etc.
That let you think if we can rewind the time, maybe we should just allocate one more bit for half precision (6 exp, 9 mantissa) and not doing this bfloat16 thing.
re #3, if your RSU windfall is substantially large, you might be eligible for the 100%/110% safe harbor that won't penalize you for tax underpayments (assuming you are a US taxpayer)
e.g., you make $200K in 2024 and $5 million in 2025 (which includes the RSU windfall). Assuming you pay at least 110% of what you paid in taxes in 2024 in 2025, you need not pay estimated tax or anything beyond statutory withholding amounts on the RSU windfall, and can just make up the 6 or 7 figures of tax owed at tax settlement time (e.g., by April 15/16 after the tax year in question). This is the optimal strategy, you can just park the money for tax owed in a close to as risk-free investment as possible in the meantime.
Statutory withholding rates might be higher; e.g., at my employer, if your RSU earnings are below $1 million, you can set your federal withholding as low as 22%. If your earnings are above $1 million, you are stuck with the 37% mandatory federal withholding rate (both done by sell to cover). This does not include per-state withholding minima, which can vary widely.
The issue is not that they want their withholding to be correct for the taxes they owe. The issue is the company needs to follow the withholding rules, and probably for cashflow reasons or maybe for tricky equity law reasons, would like the former employee to provide the withholding, rather than a net share settlement or sell to cover.
This should count as a supplemental wage payment. The 22% rate for supplemental wages only applies if income is under $1M and the person was paid wages by the employer this year or last; details in publication 15 https://www.irs.gov/publications/p15#en_US_2025_publink10002...
Thanks for mentioning the safe harbor rule. We are actually aware of that.
The issue here is that the company is asking the payment directly to the company's bank account, or the RSUs will be forfeited forever. This makes the situation much worse IMHO.
Right, the safe harbor rule isn't relevant here. The company is required to do withholding at the time the shares are delivered to you. They've chosen the most burdensome method for you as the only option. I'm not sure there's a way to legally force them to allow a sell-to-cover option, but I really hope so for y'all's sake. This feels really shady.
I have some of (possibly the?) cheapest residential electric power in the US, at 5.58 cents per kWh all-in cost here in Wyoming, 90%+ hydropower.
Absolute lowest cold here each year will be around -30 F / -34 C (there will be several nights in the winter where it gets below -20 F / -29 C), and absolute hottest it will ever be around 85 F / 29 C, but average annual temperature is about 35 F / 2 C. It can snow any month of the year here, with snow on the ground usually between November and mid May.
My house was built in 1968 and I have primarily resistive baseboard heating, with a large Mitsubishi mini-split installed by my home's previous owner mainly for air conditioning purposes in major rooms for a couple of weeks in the summer. I live at 6500 ft / 2000 m altitude, so even on the hottest summer days once the sun goes down it gets quite chilly and can get close to freezing, so it's really just for a few hours in the afternoon for a/c purposes. I otherwise use the heat pumps as baseline heat in the winter.
I'd like to put trust in heat pumps more because they are obviously more efficient (also as seen by my already low power bill), but lack of heat on certain days in the winter has serious implications here for home integrity, and while this might just be this one Mitsubishi model (though they are less than 5 years old), I haven't been left with a good opinion of heat pump design and repairability in general and am not tempted much to explore heat pumps further.
The heat pumps are rated to work down to -5 F / -21 C in the manual, but in practicality it's more like 15 F / -9 C otherwise they just spend a large part of their time defrosting. The models I have don't seem well engineered for reliability or maintenance either, there are important fuses hard-soldered to the main board that are not individually replaceable, and true enough in the middle of winter my HVAC technician and I had to bypass the blown fuses with an automotive fuse we had (same stats) attached with alligator clips, as it would take weeks or months to obtain a new $1500 (!) main circuit board from who knows where. On the other hand, resistive heating usually just works assuming you have power, and I also have two fireplaces as emergency backup if there's no power (though power lines are almost all buried here due to snow/ice anyways).
I really would like to see more emphasis on reliability and repairability rather than, like, SEER, HSPF, or COP ratings or whatever.
One of the search criteria is their rated capacity (BTH/h) at 5˚F. A 'proper' installer will figure out the energy you need for the design day in your area:
Of course before you go spending money on equipment, it's probably a better ROI on better air sealing / draft elimination and insulating. Once you're not leaking (as much) heat then you may need less powerful equipment to keep indoor conditions comfortable.
Our heat pump (~2 years old) doesn't seem to have much trouble with -21C, though we rarely get down that low. With frequent temperatures that cold, a combo unit with natural gas backup for the coldest temperatures likely makes a lot of sense.
Brute-force indices are usually arithmetic bound (e.g., GEMM).
Cell-probe based indices are usually memory bandwidth bound (IVF, LSH bucketing, etc).
Graph-based indices are usually memory latency bound (traversing linked lists / graph data structures).
(I wrote the GPU half of Faiss and work with the people who wrote this paper).
If you have a limited number of long range ICBMs then you will likely prefer more directly military targets rather than a manufacturing facility which would likely only start to matter for a conflict months into combat, which itself is a scenario (drawn out conventional war) that is likely precluded by exchange of nuclear weapons in the first place.
> If you have a limited number of long range ICBMs
China has hundreds going on thousands of ICBMs. Nobody is creating redundancy from Boise to Albany and Sunnyvale to increase survivability in case of a nuclear exchange between America and China.
Sorry, I should have said hundreds going on a thousand. Glad we put that fab in Sunnyvale!
(442 is hundreds. Your own source says the "Pentagon also estimates that China’s arsenal will increase to about 1,000 warheads by 2030, many of which will probably be 'deployed at higher readiness levels' and most 'fielded on systems capable of ranging the [continental United States]'." By 2035 that could grow up to 1,500. These are MAD figures.)
As someone who has worked in this space (approximate compute) on both GPUs and in silicon in my research, the power consumption claims are completely bogus, as are the accuracy claims:
> In this section, we show that L-Mul is more precise than fp8 e4m3 multiplications
> To be concise, we do not consider the rounding to nearest even mode in both error analysis and complexity estimation for both Mul and L-Mul
These two statements together are non-sensical. Sure, if you analyze accuracy while ignoring the part of the algorithm that gives you accuracy in the baseline you can derive whatever cherry-picked result you want.
The multiplication of two floating point values if you round to nearest even will be the correctly rounded result of multiplying the original values at infinite precision, this is how floating point rounding usually works and what IEEE 754 mandates for fundamental operations if you choose to follow those guidelines (e.g., multiplication here). But not rounding to nearest even will result in a lot more quantization noise, and biased noise at that too.
> applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products
A good chunk of the energy cost is simply moving data between memories (especially external DRAM/HBM/whatever) and along wires, buffering values in SRAMs and flip-flops and the like. Combinational logic cost is usually not a big deal. While having a ton of fixed-function matrix multipliers does raise the cost of combinational logic quite a bit, at most what they have will probably cut the power of an overall accelerator by 10-20% or so.
> In this section, we demonstrate that L-Mul can replace tensor multiplications in the attention mechanism without any loss of performance, whereas using fp8 multiplications for the same purpose degrades inference accuracy
I may have missed it in the paper, but they have provided no details on (re)scaling and/or using higher precision accumulation for intermediate results as one would experience on an H100 for instance. Without this information, I don't trust these evaluation results either.
Aiming for higher occupancy is not always a desired solution, what frequently matters more is avoiding global memory latencies by retaining more data in registers and/or shared memory. This was first noted in 2010 and is still true today:
I would also think in terms of latency hiding rather than just work parallelism (though latency hiding on GPUs is largely because of parallelism). This is the reason why GPUs have massive register files, because unlike modern multi-core CPUs, we omit latency reducing hardware (e.g., speculative execution, large caches, that out-of-order execution stuff/register renaming etc) and in order to fill pipelines we need to have many instructions outstanding, which means that the operands for those pending arguments need to remain around for a lot longer, hence the massive register file.
I agree that optimizing for lower occupancy can yield significant performance gains in specific cases, especially when memory latencies are the primary bottleneck. Leveraging ILP and storing more data in registers can indeed help reduce the need for higher occupancy and lead to more efficient kernels. The examples in the GTC2010 talks highlighted that quite well. However, I would argue that occupancy still plays an important role, especially for scalability and general-purpose optimization. Over-relying on low occupancy and fewer threads, while beneficial in certain contexts, has its limits.
The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted, which drastically reduces performance. This becomes more pronounced as problem sizes scale up (the talk examples avoids that problem). Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources. Focusing too much on minimizing thread counts can lead to underutilization of the SM’s parallel execution units. An standard example will be inference engines.
Also, while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels), designing code that depends on such strategies as a general practice can result in less adaptable and robust solutions for a wide variety of applications.
I believe there is a balance to strike here. low occupancy can work for specific cases, higher occupancy often provides better scalability and overall performance for more general use cases. But you have to test for that while you are optimizing your code. There will not be a general rule of thump to follow here.
> The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted
Kernels should almost never use local memory (except in arcane cases where you are using recursion and thus a call stack that will spill where an alternative non-recursive formulation would not really work).
> Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources
> while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels)
I think this is almost exactly backwards, performant high compute intensity kernels (on a (fl)op/byte of memory traffic basis) tend to uniformly have low occupancy; look at a ncu trace of many kernels in cuBLAS or cuDNN for instance. You need a large working set of arguments in registers or in smem to feed scalar arithmetic or especially MMA units quickly enough as gmem/L2 bandwidth alone is not sufficient to achieve peak performance in many case. The only thing you need to do is to ensure that you are using all SMs (and thus all available scalar arithmetic or MMA units) which does not by itself imply high occupancy (e.g., a kernel that has 1 CTA per SM).
The simplest way to write a memory-bound kernel is to simply spawn a bunch of threads and perform load/stores from them and it isn't too hard to achieve close to peak this way, but even then depending upon the warp scheduler to rotate other warps in to issue more load/stores is inferior to unrolling loops, and you can also get close to peak mem b/w by using not too many SMs either through such unrolling, so even these need not have high occupancy.
(I've been Nvidia GPU programming for around 11 years and wrote the original pytorch GPU backend/tensor library, the Faiss GPU library, and contributed some stuff to cuDNN in its early days such as FFT convolution.)
Remote start is accidental carbon monoxide poisoning waiting to happen if your garage is directly connected to your residence. I live in an area with brutal winters in Wyoming and just bought a new Ford Bronco, wish I could fully disable it (there's a button on the key fob as well).
Ford's remote start has a 15 minute cutoff without any intervention. Using numbers from a cold-started 2011 F-150 Raptor during a driving test and trapping it in a single car garage, 15 minutes with zero circulation results in ~40-120ppm carbon monoxide* which would still take multiple hours of exposure before symptoms occur. Your bronco probably has lower emissions and with good circulation to the much larger air volume of the rest of the house CO poisoning really shouldn't be a concern. Obviously don't use the feature with the door closed but accidentally triggering it wouldn't be that big of a deal. It also is pretty hard to accidentally trigger from the fob, requiring pressing the lock button and the remote start button twice in quick succession.
* 0.263-0.725g CO/min in a 12x22x10 ft garage. higher number is from testing under load before the cat is warmed up
Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.
If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.