Hacker Newsnew | past | comments | ask | show | jobs | submit | more ack_complete's commentslogin

To add to what others have said, additional characteristics of a Win32 critical section:

- It is purely an in-process lock, and originally the main way of arbitrating access to shared memory within a process.

- It is counted and can be recursively locked.

- It defaults to simple blocking, but can also be configured to do some amount of spin-waiting before blocking.

- CritSecs are registered in a global debug list that can be accessed by debuggers. WinDbg's !locks command can display all locked critical sections in a process and which threads have locked them.

Originally in Win32, there were only two types of lock, the critical section for in-process locking, and the mutex for cross-process or named locks. Vista added the slim reader/writer lock, which is a much lighter weight, pointer-sized lock that uses the more modern wait-on-address approach to locking.


No need for suspicion, the documentation confirms this was a factor:

https://learn.microsoft.com/en-us/windows/win32/winprog64/ru...


Ironically, the app I've had the most trouble with is Visual Studio 2022. Since it has a native ARM64 build and installation of the x64 version is blocked, there are a bunch of IDE extensions that are unavailable.


People who have worked on the Windows x64 emulator claim that TSO isn't as much of a deal as claimed, other factors like enhanced hardware flag conversion support and function call optimizations play a significant role too:

http://www.emulators.com/docs/abc_exit_xta.htm


> People who have worked on the Windows x64 emulator claim that TSO isn't as much of a deal as claimed

This is a misinterpretation of what the author wrote! There is a real and significant performance impact in emulating x86 TSO semantics on non-TSO hardware. What the author argues is that enabling TSO process-wide (like macOS does with Rosetta) resolves this impact but it carries counteracting overhead in non-emulated code (such as the emulator itself or in ARM64EC).

The claimed conclusion is that it's better to optimize TSO emulation itself rather than bruteforce it on the hardware level. The way Microsoft achieved this is by having their compiler generate metadata about code that requires TSO and by using ARM64EC, which forwards any API calls to x86 system libraries to native ARM64 builds of the same libraries. Note how the latter in particular will shift the balance in favor of software-based TSO emulation since a hardware-based feature would slow down the native system libraries.

Without ecosystem control, this isn't feasible to implement in other x86 emulators. We have a library forwarding feature in FEX, but adding libraries is much more involved (and hence currently limited to OpenGL and Vulkan). We're also working on detecting code that needs TSO using heuristics, but even that will only ever get us so far. FEX is mainly used for gaming though, where we have a ton of x86 code that may require TSO (e.g. mono/Unity) but wouldn't be handled by ARM64EC, so the balance may be in favor of hardware TSO either way here.

For reference, this is the paragraph (I think) you were referring to:

> Another common misconception about Rosetta is that it is fast because the hardware enforces Intel memory ordering, something called Total Store Ordering. I will make the argument that TSO is the last thing you want, since I know from experience the emulator has to access its own private memory and none of those memory accesses needs to be ordered. In my opinion, TSO is ar red herring that isn't really improving performance, but it sounds nice on paper.


How is it a misinterpretation? To re-quote that last sentence:

> In my opinion, TSO is a red herring that isn't really improving performance, but it sounds nice on paper.

That's the author directly saying that TSO isn't the major emulation performance gain that people think it is. You're correct that there are countering effects between TSO's benefits to the emulated code vs. the negative effects on the emulator and other non-emulated code in the same process that are fine running non-TSO, but to users, this distinction doesn't matter. All that matters is the performance of emulated program as a whole.

As for the volatile metadata, you're correct that MSVC inserts additional data to aid the emulation. What's not so great is that:

- It was basically an almost undocumented, silent addition to MSVC.

- In some cases, it will slow down the generated x64 code slightly by adding NOPs where necessary to disambiguate the volatile access metadata.

- It only affects code statically compiled with a recent version of MSVC (late VS2019 or later). It doesn't help executables compiled with non-MSVC compilers like Clang, nor any JIT code, nor is there any documentation indicating how to support either of these cases.


> How is it a misinterpretation? To re-quote that last sentence:

I think we agree in our understanding, but condensing it down to "TSO isn't as much of a deal as claimed" is misleading:

* Efficient TSO emulation is crucial (both on Windows and elsewhere)

* The blog claims hardware TSO is non-ideal on Windows only (because Microsoft adapted the ecosystem to facilitate software-based TSO emulation). (Even then, it's unclear if the author quantified the concrete impact)

* Hardware TSO is still of tremendous value on systems that don't have ecosystem support

> [volatile metadata] doesn't help executables compiled with non-MSVC compilers like Clang, nor any JIT code, nor is there any documentation indicating how to support either of these cases.

That's funny, I hadn't considered third party compilers. Those applications would still benefit from ARM64EC (i.e. native system libraries), but the actual application code would be affected quite badly by the TSO impact then, depending on how good their fallback heuristics are. (Same for older titles that were compiled before volatile metadata was added)


Following up that last part -- I recompiled my x64 codebase with /volatileMetadata-, which reduced the volatile metadata by ~20K (the remainder most likely from the statically linked CRT). The profiling results were negligible, under noise level between the builds and both about 15-30% below the native ARM64 build.

The interesting part is when the compatibility settings for the executables are modified to change the default multi-core setting from Fast to Strict Multi-Core Operation. In that mode, the build without volatile metadata runs about 20% slower than the default build. That indicates that the x64 emulator may be taking some liberties with memory ordering by default. Note that while this application is multithreaded, the worker threads do little and it is very highly single thread bottlenecked.


20% is about the general order of magnitude we observed in FEX a while ago, though as you enable all TSO compatibility settings (including those rarely needed) it'll be much higher even. As people elsewhere in the thread mentioned it'd be interesting to see how FEX fares on Asahi with hardware TSO enabled vs disabled (but with conversative TSO emulation as set up by default) since it's less of a blackbox.


> Efficient TSO emulation is crucial (both on Windows and elsewhere)

Yes, but this is not in contention...? No one is disputing that TSO semantics in the emulated x86 code need to be preserved and that it needs to be done fast, we're talking about the tradeoffs of also having TSO support on the host platform.

> The blog claims hardware TSO is non-ideal on Windows only (because Microsoft adapted the ecosystem to facilitate software-based TSO emulation). (Even then, it's unclear if the author quantified the concrete impact)

> Hardware TSO is still of tremendous value on systems that don't have ecosystem support

That isn't what the author said. From the article:

> Another common misconception about Rosetta is that it is fast because the hardware enforces Intel memory ordering, something called Total Store Ordering. I will make the argument that TSO is the last thing you want, since I know from experience the emulator has to access its own private memory and none of those memory accesses needs to be ordered. In my opinion, TSO is ar red herring that isn't really improving performance, but it sounds nice on paper.

That is a direct statement on Rosetta/macOS and does not mention Prism/Windows. How correct that assessment may be is another matter, but it is not talking about Windows only.

> Those applications would still benefit from ARM64EC (i.e. native system libraries), but the actual application code would be affected quite badly by the TSO impact then, depending on how good their fallback heuristics are.

I will have to check this, I don't think it's that bad. JITted programs run much, much better on my Snapdragon X device than the older Snapdragon 835, but there are a lot of variables there (CPU much faster/wider, Windows 11 Prism vs. Windows 10 emulator, x86 vs x64 emulation). I have a program with native x64/ARM64 builds that runs at -25% speed in emulated x64 vs native ARM64, I'm curious myself to see how it runs with volatile metadata disabled.


This is more like what I’d expect! This is a great article too, thank you, this is the kind of thing I come to HN for :)


This is due to an overall odd strategy by the DirectX graphics team, which is to implement many of the optimization and enhancement features by Detour-ing API calls in the OS. Essentially, the Windows OS is patching itself to implement features.

Unfortunately, this is being done without core OS assistance like the AppCompat system, so it comes with similar problems to unassisted regular user-space patching. In this case, the Detours code used by DXGI is unable to support the current PAC-enabled function prologues in the current version of Windows 11 ARM64. It isn't limited to just the OP's scenario; attempting to enable Auto Super Resolution (AutoSR) on any native ARM64 program using DirectX will also currently crash in a similar manner in EnumDisplaySettings().

The full screen optimization that is mentioned also has some history. It's well intentioned to remove an entire full-screen copy per frame and increase performance/efficiency, but had some problems. When it was originally implemented in Windows 10, it broke full screen mode for some DirectX 9 apps because it made some incorrect assumptions about the window handle supplied by the application for focus tracking. But it was frustrating to deal with, because the mechanism was nearly undocumented and had no opt-out besides a manual user compatibility checkbox. It took me a couple of days of tearing apart the core windowing guts of my program to figure out what Microsoft had done and how to work around it, and it took several months for the Windows team to fix it on their end.


That strategy is described here, and comes from game developers wanting DirectX to be untied from OS versions as it used to be.

https://devblogs.microsoft.com/directx/gettingstarted-dx12ag...


The Agility SDK is unrelated to what is being discussed here. It is a delivery mechanism for the Direct3D 12 runtime instead of DXGI, it is opt-in from the application side, and is done through a more stable loader stage in the core OS instead of hot-patching.

It does, however, give insight into the situation that the DirectX team is in. The in-box version of D3D12Core in the latest version of Windows 11 is SDK version 612. This can be compared against the released Agility SDK versions:

https://devblogs.microsoft.com/directx/gettingstarted-dx12ag...

SDK version 612 is just before the 613 Agility SDK release on 3/11/2024. This means that the version of DirectX 12 they are able to ship in the main OS is a year and a half behind the latest released version.


I thought that it also included updated DXGI components.


No, the Agility SDK only includes updated D3D12 core and debug layer components.

https://microsoft.github.io/DirectX-Specs/d3d/D3D12Redistrib...


I see, thanks.

For my DirectX hobby coding, what is in the box even if a bit oldie is good enough, so I got my info about the Agility SDK wrong.


>In this case, the Detours code used by DXGI is unable to support the current PAC-enabled function prologues in the current version of Windows 11 ARM64

Surely microsoft could avoid patching this on arm in the first place though right? As in whatever gating they use should make sure it’s not on arm.


The features in question are platform agnostic. They could temporarily disable the detouring for native ARM64 apps since the mechanism is broken on that architecture, but it's not ideal since the Snapdragon X platform is one of the main platforms for the Auto Super Resolution feature. Upgrading sequential/blit mode swap chains to flip model theoretically should be legacy, but as OP shows, it's not really as new graphics code is still being shipped without flip model support.

Note that the detouring problem is only an issue for native ARM64 programs. x64 programs running in emulation on Windows ARM64 work fine.


Many Deflate encoders, including zlib/gzip, are greedy encoders that work forwards, either for speed or to support streaming compression. They encode runs as they are found scanning forward, with limited lookahead to try to allow longer matches to preempt immediately preceding shorter matches. There is an "optimal parse" strategy that maximizes the runs found by processing the entire file backwards.

If you repack the plaintext using zopfli, it does encode the text as you suggest.


Being greedy is fine here, though. It's the exact same match but somehow cut one byte short.

Also I just tested something. If you stick a random extra letter onto the start of the string, the mistake goes away and the output shrinks by a byte. Is it possibly an issue with finding matches that start at byte 0...?

More testing: Removing the Ts that fail to match to get TOBEORNOTOBEOROBEORNOT shrinks the output by 2 bytes. Changing the first character to ZOBEORNOTOBEOROBEORNOT stays at the same shrunken size. Then removing that Z to get OBEORNOTOBEOROBEORNOT makes it balloon back up as now it fails to match the O twice.

If I take the original string and start prepending unique letters, the first one shrinks and fixes the matching, and then each subsequent letter adds one byte. So it's not that matching needs some particular alignment, it's only failing to match the very first byte.

A new test: XPPPPPPPP appears to encode as X, P, then a copy command. And PPPPPPPPP encodes as P, P, then a copy. Super wrong.


Interesting, I can also reproduce this. I wonder if it's an artifact of zlib's sliding window setup. The odd part is that if I try various libs with advzip, both the libdeflate and 7z modes show the same artifact, only the zopfli mode is able to avoid it. Doesn't seem to be a format violation as gzip -t doesn't complain about a copy at position 0.


Microsoft has been inconsistent, both in recommendation and what they've done in their own programs. Probably the most inexplicable is that in the Windows Vista timeframe they recommended that games place their save games into a folder under My Documents despite the Saved Games folder having also been introduced in Vista.


Yes, but so does ARM. ld1 {v0.16b,v1.16b,v2.16b,v3.16b},x0,#64 loads 4 x 128-bit vector registers and post-increments a pointer register.


> Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode.

Not a RISC-V programmer, but this drives me crazy on ARM. Dozens of optional features, but the FEAT_ bits are all readable only from EL1, and it's unspecified what API the OS exposes to query it and which feature bits are exposed. I don't care if it'd be slow, just give us the equivalent of a dedicated CPUID instruction, even if it just a reserved opcode that traps to kernel mode and is handled in software.


> but the FEAT_ bits are all readable only from EL1, [...] I don't care if it'd be slow, just give us the equivalent of a dedicated CPUID instruction, even if it just a reserved opcode that traps to kernel mode and is handled in software.

I like the way the Linux kernel solves this: these FEAT_ bits are also readable from EL0, since trying to read them traps to kernel mode and the read is emulated by the kernel. See https://docs.kernel.org/arch/arm64/cpu-feature-registers.htm... for details. Unfortunately, it's a Linux-only feature, and didn't exist originally (so old enough Linux kernel versions won't have the emulation).


Indexed color formats stopped being supported on GPUs past roughly the GeForce 3, partly due to the CLUTs being a bottleneck. This discourages their use because indexed textures have to be expanded on load to 16bpp or 32bpp vs. much more compact ~4 or 8bpp block compressed formats that can be directly consumed by the GPU. Shader-based palette expansion is unfavorable because it is incompatible with hardware bilinear/anisotropic filtering and sRGB to linear conversion.


Tbf, you wouldn't want any linear filtering for pixel art textures anyway, and you can always implement some sort of custom filter in the pixel shader (at a cost of course, but still much cheaper than a photorealistic rendering pipeline).

Definitely might make sense IMHO since block compression artefacts usually prohibit using BCx for pixel art textures.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: