The physical register file is much larger than the logical register count. A budget option could simply reduce the amount of renaming done to save space.
In Intel's case, Cannon Lake did have AVX 512, but was blocked from being a mainstream part due to 10nm yields. And then their rushed efficiency core strategy effectively disabled AVX-512 just as they were getting back on track.
I don't think there's an intrinsic reason you couldn't have efficiency cores run AVX-512 albeit slowly and expect we'll see just that.
> And then their rushed efficiency core strategy effectively disabled AVX-512 just as they were getting back on track.
I partly blame Linux for that. I remember asking at Kernel Recipes about supporting truly heterogenous multi - processor systems and got shrugged "don't buy broken hardware".
Back then, it was for a Broadcom home gateway product, which has an asymmetrical dual core, one with a FPU, the other without. Since then we have seen many examples of such assymetry: most HMP smartphones have asymmetrical instruction set. Mono (and probably all JIT VMs) hit issues of varying cache length so the perfect abstraction is already gone. And now we have Intel E vs P.
This is a rather hard problem, I won't pretend otherwise but the amount of dead silicon, and lost power efficiency accumulates significantly.
At least on x86 the CPUID instruction is part of the problem. Userspace can do feature-detection and then rely on that. But if it's inconsistent between cores then thread migration would cause illegal instruction faults.
If the kernel tried to fix that by moving such faulting threads to P-cores that would lead to a memcpy routine with AVX512 instructions cause all threads to be moved off E-cores.
So first intel would have to introduce new CPUID semantics to indicate that e.g. AVX512 is not supported by default and then a separate flag indicating that it's specifically supported on this core and then userspace would have to pin the thread if it wants to use them or stick to the default set if it wants to be migratable.
I don't really know how CPUID works, but I'm guessing it can be trapped by Linux. So I think that a first "stupid" implementation would be for Linux to report in CPUID intersecting section of CPUs on which the process is allowed to run. So if you want to run AVX512, you first need to pin that process to an AVX512 CPU. You would be able to find an AVX512 CPU by checking in /proc/cpuinfo. (even this "simple" variant is far from first because the cpuset can be changed dynamically in various ways, like Android would move a process from foreground CPUs to backgrouns CPUs using cgroups)
Not sure if you can trap on cpuid, bit the kernel does have control to which cpuid bits are exposed to your application
So requiring pinning to see all the bits could work, but then the issue is what happen if the affinity is changed. A static list of required capabilities in some ELF header would probably be better.
> A static list of required capabilities in some ELF header would probably be better.
I think I agree, the thing is that it's a kind-of security issue. I suggested pinning, because it requires CAP_SYS_NICE, which is a feature: If you allow apps to freely declare their usage, they will end up being scheduled not fairly, because system will stick them to P cores.
That being said, you could have indeed an ELF header mentioning since, and then ignore it if caller doesn't have CAP_SYS_NICE. I do feel using an ELF header for that is weird, but my knowledge of ELF is way too little to judge.
Another thing that could work is using file-system attributes or mode (like setuid), but I think FS support of attributes is at best spotty, and I doubt modes can be extended.
Maybe I'm dumb and for sure I'm not an expert of this subject but wouldn't we need an executable containing both an AVX512 code path and an alternative plain code one, plus a way to switch code paths according to the core the code is running on? The same memory page would run in a P core or in an E core. Inefficient because of the extra checks?
Or maybe a new system call to allow a thread to temporarily enter a “performance mode” where it can only be scheduled on the powerful cores. Pinning sounds a bit too strict.
You can already pin to a set of cores instead of a single one. But anyway, my point is that currently userspace interacts directly with CPU features without intermediation from the kernel. So intel would have to think about how to coordinate with userspace too, not just rely on the kernel to patch things up (or not).
Since Android is Linux, won't manufacturers of such smartphone contribute solutions?
Big companies like Samsung should have more than enough resources and interest in doing so. Unlike the guy who answered you at Kernel Recipes, I guess.
I don’t believe arm has this problem. They are careful to ensure the same instruction set is available on all cores in the chip. This is a botched launch from intel. Software is not the solution.
In Intel's case, Cannon Lake did have AVX 512, but was blocked from being a mainstream part due to 10nm yields. And then their rushed efficiency core strategy effectively disabled AVX-512 just as they were getting back on track.
I don't think there's an intrinsic reason you couldn't have efficiency cores run AVX-512 albeit slowly and expect we'll see just that.