More

ack_complete · 2025-12-21T21:39:30 1766353170

The antivirus / EDR / monitoring / inventory software that most corporate IT departments installs ages computers ten years. We constantly had problems with such services slamming the disk, holding files open, breaking software, running CPUs at 100%, etc.

somehnguy · 2025-12-22T00:58:12 1766365092

Crowdstrike Falcon is likely the only reason my work M1 Pro machine runs like a dog. Any time it's being a laggy piece of junk you can open Activity Monitor and see Falcon just slamming it.

anthk · 2025-12-21T23:24:39 1766359479

Not my problem. You wouldn't need an antivirus with a properly locked browser with UBlock Origin and OFC no damn HTML email. GPO's blocking anything not being under an executable whitelist.

If any, your email client should open any attachment under a sandbox, such as Sandboxie, under a libre license:

https://github.com/sandboxie-plus/Sandboxie

Of course no Office macros would be allowed, ever.

ack_complete · 2025-12-21T21:33:53 1766352833

Sets the underlying Registry keys for the Group Policy "Select the target Feature Update version". It tells the Windows Update service to select updates for a specific feature update instead of offering latest.

https://gpsearch.azurewebsites.net/Default.aspx?PolicyID=151...

markus_zhang · 2025-12-22T01:01:12 1766365272

Thank you!

ack_complete · 2025-12-21T21:27:51 1766352471

Worse than that, there's no consistency in Fn+key shortcuts. Recently acquired an HP Ergonomic Keyboard as a replacement for a broken Sculpt, only to find out that it literally cannot send Ctrl+Break -- there's no key for it, no Fn+key shortcut for it and the remapping software doesn't simulate it properly.

intrasight · 2025-12-21T22:08:13 1766354893

Buy the keyboard you want. There are plenty of good ones.

OJFord · 2025-12-22T09:20:28 1766395228

They're talking about laptops, rarely any choice besides ANSI/ISO; maybe a few countries' accents and layouts of the latter.

ack_complete · 2025-12-22T20:14:55 1766434495

The keyboard I was mentioning isn't a laptop keyboard, actually, but laptop keyboards tend to be in a slightly better spot as the major vendors typically have Fn shortcuts for the missing keys, like Fn+B for Break, and they also document them in the user guides.

Detached keyboards seem to be more of a wild west, especially when they target multiplatform -- and it's always the stuff they don't document that screws you.

ack_complete · 2025-12-19T20:14:13 1766175253

This is unfortunately the same for GPUs. The graphics APIs expose capability bits or extensions indicating what features the hardware and driver supports, but the graphics vendors don't always publish documentation on what generations of their hardware support various features, so your program is expected to dynamically adapt to arbitrary combinations of features. This is no longer as bad as it used to be due to consolidation in the graphics market, but people still have to build ad-hoc crowd sourced databases of GPU caps bits.

It's also not monotonic, on both CPU and GPU sides features can go away later because either due to a hardware bug or the vendor lost interest in supporting it.

ack_complete · 2025-12-18T12:29:04 1766060944

AVX(2)'s main advantage is 256-bit width, since many of its operations are simply concatenated 128-bit ops (weird for ops like VPALIGNR), and cross-lane operations are expensive. NEON, on the other hand, only supports 128-bit ops, so AVX operations need to be split by the emulator.

I'd expect more of a gain from enabling FMA, but that's assuming the program actually got built to use FMA -- it needs to either use it explicitly or have relaxations to allow the contraction. Oryon has 4 x 128-bit NEON pipes with 3c latency fadd and 4c latency fmul/fma, so it easily ends up latency bottlenecked unless there are plenty of independent calculations.

ack_complete · 2025-12-18T12:04:26 1766059466

The compiler is also a factor, as MSVC's ARM64 backend is less mature than the x64 backend, while the xtajit(64) emulators in Windows were written by emulation veterans. But even then, I've typically seen a ~25% penalty between a native optimized ARM64 build and an emulated optimized x64 build. Major optimized code paths being disabled or suboptimal in the ARM64 port would definitely be more plausible, especially in licensed third-party libraries.

ack_complete · 2025-12-14T08:17:37 1765700257

The stdcall calling convention used APIs and API callbacks on Windows x86 doesn't use registers at all, all parameters are passed on the stack. MSVC does support thiscall/fastcall/vectorcall conventions that pass some values in registers, but the system APIs and COM interfaces all use stdcall.

Windows x64 and ARM64 do use register passing, with 4 registers for x64 (rcx/rdx/r8/r9) and 8 registers for ARM64 (x0-x7). Passing an additional parameter on the stack would be cheap compared to the workarounds that everyone has to do now.

ack_complete · 2025-12-14T04:42:22 1765687342

There's an annoying corner case when using SetWindowLongPtr/GetWindowLongPtr() -- Windows sends WM_GETMINMAXINFO before WM_NCCREATE. This can be worked around with a thread local, but a trampoline inherently handles it. Trampolines are also useful for other Win32 user functions that don't have an easy way to store context data, such as SetWindowsHookEx(). They're also slightly faster, though GetWindowLongPtr() at least seems able to avoid a syscall.

The code as written, though, is missing a call to FlushInstructionCache() and might not work in processes that prohibit dynamic code generation. An alternative is to just pregenerate an array of trampolines in a code segment, each referencing a mutable pointer in a parallel array in the data segment. These can be generated straightforwardly with a little template magic. This adds size to the executable unlike an empty RWX segment, but doesn't run afoul of any dynamic codegen restrictions or require I-cache flushing. The number of trampolines must be predetermined, but the RWX segment has the same limitation.

rovingeye · 2025-12-14T07:10:02 1765696202

I wasn't aware of the thread local trick, I solve this problem by not setting WS_VISIBLE and calling SetWindowPos & ShowWindow after CreateWindow returns (this solves some other problems as well..)

201984 · 2025-12-14T13:31:14 1765719074

FlushInstructionCache isn't needed on x86_64. I-cache and D-cache are coherent.

ack_complete · 2025-12-16T02:17:04 1765851424

I'm not convinced this is always guaranteed for a Windows x64 program. When running on bare x64 hardware, FlushInstructionCache() does seem to be an (inefficient) noop on Windows 11 x64, but when running in emulation on Windows 11 ARM64, it's running a significantly larger amount of ARM64 native code -- it looks like it might be ensuring that stale JIT code is flushed.

ack_complete · 2025-12-10T03:56:44 1765339004

Nah. As others have said, translating infix to RPN is pretty easy to do. The nasty part was keeping values within registers on the stack, especially within loops. The 8087 couldn't do binary ops between two arbitrary locations on the stack, one had to be the top of stack. This meant that if you need to add two non-top locations, for example, you had to exchange (FXCH) one of them to the top of the stack first. This meant that optimized x87 code tended to be a mess of FXCH instructions.

Complicating this further, doing this in a loop requires that the stack state match between the start and end of the loop. This can be challenging to do with minimal FXCH instructions. I've seen compilers emit 3+ FXCH instructions in a row at the end of a loop to match the stack state, where with some hairy rearrangement it was possible to get it down to 2 or 1.

Finally, the performance characteristics of different x87 implementations varied in annoying ways. The Intel Pentium, for instance, required very heavy use of FXCH to keep the add and multiply pipelines busy. Other x87 FPUs at the time, however, were non-pipelined, some taking 4 cycles for an FADD and another 4 cycles for FXCH. This meant that rearranging x87 code for Pentium could _halve_ the speed on other CPUs.

rasz · 2025-12-10T04:25:29 1765340729

To the last point I would see it the other way around. Rearranging code for pipelined 0 cycle FXCH Pentium FPU speed up floating point by probably way more than x2 compared to heavily optimized code running on K5/K6. Im not even sure if K6/-2 ever got 0 cycle FXCH, K6-3 did, but still no FPU pipelining until Athlon.

Quake wouldnt happen until Pentium 2 if Intel didnt pipeline FPU.

ack_complete · 2025-12-10T05:02:18 1765342938

You're not wrong, the performance gain from proper FPU instruction scheduling on a Pentium was immense. But applications written prior to Quake and the Pentium gaining prominence or non-game oriented would have needed more blended code generation. Optimizing for the highest end CPU at the time at the cost of the lowest end CPU wouldn't necessarily have been a good idea, unless your lowest CPU was a Pentium. (Which it was for Quake, which was a slideshow on a 486.)

K6 did have the advantage of being OOO, which reduced the importance of instruction scheduling a lot, and having good integer performance. It also had some advantage with 3DNow! starting with K6-2, for the limited software that could use it.

ack_complete · 2025-11-25T04:13:30 1764044010

It also happens with digital cameras for similar reasons, due to CCD scanning. But yeah, that doesn't happen looking directly at a CRT.

The bloom is also too blobby, because it's a gaussian blur. I ran into the same issue trying to implement a similar effect. The bloom shape needs to sharper to look realistic -- which also means unfortunately a non-separable blur.