Hacker Newsnew | past | comments | ask | show | jobs | submit | more ack_complete's commentslogin

Sure, but I have to support a range of target CPUs in the consumer desktop market, and the older CPUs are the ones that need optimizations the most. That means NEON on ARM64 and AVX2 or SSE2-4 on x64. Time spent on higher vector instruction sets benefits a smaller fraction of the user base that already has better performance, and that's especially problematic if the algorithm has to be reworked to take best advantage of the higher extensions.

AVX-512 is also in bad shape market-wise, despite its amazing feature set and how long it's been since initial release. The Steam Hardware Survey, which skews toward the higher end of the market, only shows 18% of the user base having AVX-512 support. And even that is despite Intel's best efforts to reverse progress by shipping all new consumer CPUs with AVX-512 support disabled.


The problem is that the built-in mechanism is often microcode, which is still slower than plain machine code in some cases.

There are some interesting writings from a former architect of the Pentium Pro on the reasons for this. One is apparently that the microcode engine often lacked branch prediction, so handling special cases in the microcode was slower than compare/branch in direct code. REP MOVS has a bunch of such cases due to the need to handle overlapping copies, interrupts, and determining when it should switch to cache line sized non-temporal accesses.

More recent Intel CPUs have enhanced REP MOVS support with faster microcode and a flag indicating that memcpy() should rely on it more often. But people have still found cases where if the relative alignment between source and destination is just right, a manual copy loop is still noticeably faster than REP MOVS.


Apparently they just shut it down in 2024, but a couple of years ago I tested an Atari 1030 modem by dialing out to Earthlink, and it still worked -- successfully connected at 300 baud.


It used to block focus stealing aggressively unless a program had foreground permission or was given it (AllowSetForegroundWindow), but the mechanism seems broken in current versions of Windows.


Do you have a link to more details or something? I haven't seen or heard of what you describe.


Afraid not. The documentation for AllowSetForegroundWindow() and the associated mechanism still exists, of course:

https://learn.microsoft.com/en-us/windows/win32/api/winuser/...

But the last time I tried to test code using this to properly hand off foreground permission from one process to another, I had a hard time testing it because I couldn't get it to fail. When this mechanism was first introduced in Windows 98 and 2000, it was pretty aggressive -- if you were past the input timeout and foreground permission hadn't been forwarded or already shared, the target application would fail to come to the front and its taskbar button would light up instead. I haven't seen this happen in a long time on current Windows, programs steal focus all the time.


Haven't used windows in a decade, but there is (was?) a registry setting that would disable focus stealing prevention. Some egregious tools "helpfully" changed that setting for you when you installed them, because they couldn't get focus management to work properly. Maybe it's that?


Interference from some program is a possibility, but the relevant foreground lock registry keys seem default on all my systems.


I'm pretty sure something weird is going on in your case because I recently had to fight a case of this that had seemingly gotten more aggressive in Windows 11, not less. The focus stealing prevention has always been there and is still there as far as I've observed.


You have to go pretty far back, it was in the Visual C++ 6.0 EULA, for instance (for lack of a better link):

https://proact.eu/wp-content/uploads/2020/07/Visual-Basic-En...

It wasn't a blanket prohibition, but a restriction on some parts of the documentation and redistributable components. Definitely was weird to see that in the EULA for a toolchain. This was removed later on, though I forget if it's because they changed their mind or removed the components.


There are some other annoyances, like not being able to inline initialize a bitfield prior to C++20, and sometimes having to use unnatural typing to get optimal packing depending on the ABI. But I've seen them used, compilers have gotten pretty good at optimizing them and can coalesce writes or tests against adjacent bitfields.


> If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.

The complication is that cache is a global resource. Code that has larger data structures, even if it runs faster, can contribute to a larger working set and a slightly higher cache miss rate in other code. This can lead to a death-by-thousand cuts scenario and it's hard to get solid data on this when profiling.

You're right, though, that there are a number of ways that smaller structs can be slower, either directly by execution time or indirectly by larger code causing more misses in the instruction cache. Some other factors include whether the compiler can coalesce accesses to multiple fields grouped together, or whether the struct hits a magic power of two size for faster array indexing or vectorization. Which means you can end up speeding up code occasionally by making the struct bigger.


Even then, I've seen LLMs generate code with subtle bugs that even experienced programmers would trip on. For the Atari specifically, I've seen:

- Attempting to use BBC BASIC features in Atari BASIC, in ways that parsed but didn't work - Corrupting OS memory due to using addresses only valid on an Apple II - Using the ORG address for the C64, such that it corrupts memory if loaded from Atari DOS - Assembly that subtly doesn't work because it uses 65C02 instructions that execute as a NOP on a 6502 - Interrupt handlers that occasionally corrupt registers - Hardcoding internal OS addresses only valid for the OS ROM on one particular computer model

The POKE 77,0 in the article is another good example. ChatGPT labeled that as hiding the cursor, but that's wrong -- location 77 is the attract timer counter on the Atari OS. Clearing it to 0 periodically resets the timer that controls the OS's primitive screensaver. But in order for this to work, it has to be done periodically -- doing it at the start will just reset this timer once, after which attract mode will start in 9 minutes. So effectively, this is an easter egg that got snuck into the program, and even if the unrequested behavior was desirable, doesn't work.


Note that this issue also affects NEON. Two examples are vmull_p64(), which requires the Crypto extension -- notably absent on RPi3/4 -- and vqrdmlah_s32(), which requires FEAT_RDM, not guaranteed until ARMv8.1. Unlike Intel, ARM doesn't do a very good job of surfacing this in their intrinsics guide.


Perforce does not lock files on checkout unless you have the file specifically configured to enforce exclusive locking in the file's metadata or depot typemap.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: