Hacker Newsnew | past | comments | ask | show | jobs | submit | p12tic's commentslogin

Depends on a server. This test got 79W idle for _two socket_ E5 2690-V4 server.

https://www.servethehome.com/lenovo-system-x3650-m5-workhors...


The problem is with the form factor, not the server hardware per-se. If one buys regular ATX motherboard that accepts server CPUs and fits it in regular ATX case, then there's lots of space for a relatively silent CPU air cooler. 2690 v4 idles at less than 40W which is not much more than a regular gaming desktop with a powerful GPU.

The only problem in practice is that server CPUs don't support S3 suspend, so putting whole thing to sleep after finishing with it doesn't work.


Better build a single workstation - less noise, less power usage and the form factor is way more convenient. A budget of $3000 can buy 128 cores with 512GB of RAM on a single regular EATX motherboard, a case, a power supply and other accessories. Power usage is ~550W at maximum utilization which not much more than a gaming rig with a powerful GPU.


> Today it's a bit more complicated when you have servers with 100+ cores as an option for under $30k (guestimate based on $10k CPU price).

If one can buy used, then previous generation 128C 256T epyc server is less than $5k. For homelabs that can accept non-rackmount gear it's less than $3k.


That's just an artifact of Intel disabling ECC on consumer processors.

There's no reason for ECC to have significantly higher power consumption. It's just an additional memory chip per stick and a tiny bit of additional logic on CPU side to calculate ECC.

If power consumption is the target, ECC is not a problem. I know firsthand that even old Xeon D servers can hit 25W full system idle. On AMD side 4850G has 8 cores and can hit sub 25W full system idle as well.


My HP 800 mini idles at 3W


State of the art of local models is even further.

For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.

The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.


It's not impressive nor efficient when you consider batch sizes > 1.


All of this is for batch size 1.


I know. That was my point.

Throughput doesn't scale on CPU as well as it does on GPU.


We both agree. Batch size 1 is only relevant to people who want to run models on their own private machines. Which is the case of OP.


Incorrect. https://en.wikipedia.org/wiki/USB_hardware#USB_Power_Deliver... is a good start about the subject: "PD-aware devices implement a flexible power management scheme by interfacing with the power source through a bidirectional data channel and requesting a certain level of electrical power <...>".


You can tell I have been in the 5v world too much, thanks for the correction.


For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.


Yes loaded from RAM and loaded to RAM are the big distinction here.

It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.


It's not too expensive of a Macbook to fit 109B 4-bit parameters in RAM.


Is a 64GiB RAM Macbook really that expensive, especially compared against NVidia GPUs?


That's why I said it's not too expensive.


Apologies, I misread your comment.


Most laptops are severely limited by heat dissipation. So it's normal that performance is much worse. The CPU cannot stay in turbo as long and must drop to lower frequencies sooner. On longer benchmarks they CPU starts throttling due to heat and becomes even slower.


Container security boundary can be much stronger if one wants.

One can use something like https://github.com/google/gvisor as a container runtime for podman or docker. It's a good hybrid between VMs and containers. The container is put into sort of VM via kvm, but it does not supply a kernel and talks to a fake one. This means that security boundary is almost as strong as VM, but mostly everything will work like in a normal container.

E.g. here's I can read host filesystem even though uname says weird things about the kernel container is running in:

  $ sudo podman run -it --runtime=/usr/bin/runsc_wrap -v /:/app debian:bookworm  /bin/bash
  root@7862d7c432b4:/# ls /app
  bin   home            lib32       mnt   run   tmp      vmlinuz.old
  boot  initrd.img      lib64       opt   sbin  usr
  dev   initrd.img.old  lost+found  proc  srv   var
  etc   lib             media       root  sys   vmlinuz
  root@7862d7c432b4:/# uname -a
  Linux 7862d7c432b4 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux


gVisor is solid but it comes with a perf hit. Plus, it does not work on every image


FWIW the performance loss got a lot better in ~2023 when the open source gVisor switched away from ptrace. (Google had an internal non-published faster variant from the start.)


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: