I agree. This article is clearly written with a generative language model. A few other telltale signs:
1. Repeating the same thing multiple times with slight variation:
* "allowing developers to fine-tune their applications and unlock the full potential of their underlying hardware, ultimately maximising vLLM performance." (fine-tune, unlock potential, maximize performance are all roughly the same thing)
* "AI and machine learning models" (AI and machine learning models are the same thing in the context of this article)
* "utilise multiple threads or cores" (Why differentiate between threads and cores?)
* "tailored to enhance computational efficiency and overall throughput" (efficiency and throughput are highly related)
* "a series of graphs and data visualisations" (all the data visualizations in this article are graphs)
* "more computational effort and time" (same thing)
* "significantly enhanced the performance and efficiency" (same thing)
* "ensuring efficient processing and superior performance for complex and demanding AI workloads" (same things)
2. Explaining what "rocBLAS" stands for multiple times.
3. Other ChatGPTisms:
* "offering a comprehensive view of [...]"
* "Let’s delve into the notable advancements achieved through [...]"
* "ensures quicker processing times, which is crucial for [...]"
* "effectively mitigated these impacts, maintaining [...]"
* "elucidate the impact of"
* "significantly enhanced"
* "These results underscore the critical role of [...]"
So why is this bad? - Because it undermines the trust in the the article. We do not know whether the claims are actually true or whether they were just made up by ChatGPT.
What if that person is not native English and wrote something up and then threw it into chatgpt (or a local chatbot running on 1 MI300x :p) just because he felt that his relatively limited vocabulary would not be enough to express everything?
That person (yeah :p), might just be trying to create as much awareness as possible.
You might get annoyed by the usage of LLM's, some might not.
I get annoyed by people still trying to undermine the testing done while everything is clearly extremely transparant, even the docker image is shared..
That said, the article is about the results, if you'd like to "delve" a bit deeper into those results, let me know, i'd be happy to go over some of the data visualisations ;-)
> Loras are just as powerful as a finetuned model and you can train one in minutes even on consumer hardware.
Do you have some more details on training a LoRA in minutes? Last I tried, it took several hours on an RTX 3090, but I am sure there have been improvements since then.
In my experience, the canvas api is very slow and not well thought-out. For example, to create a native image object from raw pixels, you have to copy the pixels into an ImageData object, draw it to a canvas, create a data URL from the canvas and then load an image from that data URL.
You are correct. The code is using an inefficient cache access pattern, so most of the time is spent waiting.
You probably won't get 100x faster without SIMD, but 10x is certainly doable. Unfortunately, SIMD.js support has been removed from Chrome and Firefox a while ago, even though it is not available in wasm to this day.
How would SIMD do anything to address the problem's fundamental anti-cache-friendly access patterns? You'd need to restructure the problem to be cache-friendly, but SIMD won't really be relevant to that.
My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps. Doing 90-degree image rotation with fixed steps and some index calculations should work better (0.18 sec vs 1.5 sec for their implementation in node.js):
for (var y = 0; y < height; y++)
for (var x = 0; x < width; x++)
b[x + y*width] = a[y + (width - 1 - x)*height];
Although that's still far from the theoretical maximum throughput because the cache utilization is really bad. If you apply loop tiling, it should be even faster. This problem is closely related to matrix transpose, so there is a great deal of research you can build upon.
EDIT: 0.07 seconds with loop tiling:
for (var y0 = 0; y0 < height; y0 += 64){
for (var x0 = 0; x0 < width; x0 += 64){
for (var y = y0; y < y0 + 64; y++){
for (var x = x0; x < x0 + 64; x++){
b[x + y*width] = a[y + (width - 1 - x)*height];
Your 0.18 sec result is (to use the units they used in the article) 180ms, and if I understand correctly their best webassembly compiled and executed result (?) is 300ms. Beautiful.
EDIT: But it could also be that your computer is somewhat faster than theirs? Do you happen to have some very fast CPU? Can you say which? When I run C-like C++ versions of your code I get the speeds you get with node.js. However, you made overall much better results than they were able, it's still great work!
#include <stdio.h>
int main(int argc, char* argv[]) {
enum { height = 4096, width = 4096 };
unsigned* a = new unsigned[ height*width ];
unsigned* b = new unsigned[ height*width ];
if ( argc < 2 ) { // call with no params
// to measure overhead when just allocations
// and no calculations are done
printf( "%d %d\n", (int)a, (int)b );
return 1;
}
if ( argv[1][0] == '1' ) // call with 1 the fastest
for (unsigned y0 = 0; y0 < height; y0 += 64)
for (unsigned x0 = 0; x0 < width; x0 += 64)
for (unsigned y = y0; y < y0 + 64; y++)
for (unsigned x = x0; x < x0 + 64; x++)
b[x + y*width] = a[y + (width - 1 - x)*height];
else
for (unsigned y = 0; y < height; y++)
for (unsigned x = 0; x < width; x++)
b[x + y*width] = a[y + (width - 1 - x)*height];
return 0;
}
Or maybe not: my short experiments with the simplified version based on their algorithm and his JavaScript versions gave some conflicting results. I haven't thoroughly verified them, this note is just to motivate the others to try.
I get 60ms in C. But in your code, the compiler might decide to remove most of the code since b is not used after being calculated. I checked the assembly code and it does not seem to be the case here, but it's still something to be aware of.
OK, I get cca 80ms for my run with the parameter 1 on my main computer, and 200ms on N3150 Celeron.
> b is not used after being calculated
Earlier, I've never seen that any C compiler optimizes away the call to the allocator and the access to the so allocated arrays. Maybe it's different now? Hm, dead code elimination... I guess a random init of the few values before and read and print of a few values after the loop must be always safe... Now that I think, also filling the array with zeroes before.
These code motion/strength reduction optimizations are standard even in mildly optimizing compilers. I would be very surprised if an optimizing JavaScript compiler did not perform them automatically.
I tried a few micro-optimizations, but they did not make a measurable difference, so I kept the code short instead. But maybe some JIT is particularly bad at loop hoisting, so it might make a difference there.
Huh interesting! I always disliked butchering code to do processor cache optimizations and I kinda worked under the impression that a browser’s JS and wasm compilers would do these optimizations for me.
I’ll definitely give tiling a spin (although at this point we are definitely fast enough™️)
Can someone please explain why loop tiling increases performance in JS so dramatically? Is it mainly due to the fact that inner loops have constant size (64) and get called more frequently, and thus get promoted faster into deeper stages of JS runtime optimization?
My guess is that if you try to invoke initial whole code (before tiling) in a external loop (rotating images of exactly the same size), you will get similar perf boost (not that it has practical implication, but just to understand how optimization works).
No, it's faster because the working set of 64 * 64 * 4 * 2 bytes can (almost) fit in CPU core L1 cache. Further cache levels are slower and finally the memory is glacially slow.
WASM example would speed up as well using the same approach. Or C, Rust or whatever.
Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.
The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed.
After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.
(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).
BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...
> As I understand they the main goal was to achieve easily readable and maintainable code, even to the detriment of performance.
Seems like a tricky goal for image algorithms in general where you're performing the same action over and over on millions of pixels. Obscure inner loop optimisations are pretty much required.
In these situations, I would sometimes keep the code for the naive but slow version around next to the highly optimised but difficult to understand version. You can compare the output of them to find bugs as well.
> My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps.
Why would non-1 for loop be slower in some browsers? Does the compiler add some sort of prefetch instruction in the faster browsers based on the loop increment?
All of these top comments remind me of the general adage that encryption and security are really hard, and if you think you know what you're doing you probably don't.
I don't want to be too harsh for a ShowHN, but even if the authors fixed the several bugs that have already been reported here, it's clear they don't have a foundational enough understanding of cryptography and security to be writing a password manager. I would suggest they spend more time understanding the basics first.
Ah, the good old OpenGL fixed function pipeline, deprecated for over 11 years now.
For something a bit more modern, I'd recommend [0], but one might argue that old OpenGL is easier to learn since you don't have to setup your own shaders.
To clarify, the point of Scotty3D is not to teach OpenGL -- the students do not write any OpenGL, DirectX, Vulkan, etc in the class. The OpenGL that is there is simply used to render the 3D models and the UI, so updating that code is pretty low priority.
One of my longer-term goals as a TA for the class is to update Scotty3D to Vulkan or modern OpenGL.
EDIT: I wanted to expand on this point, as this is actually an important part of the philosophy of the design of the course. As the OP argues as well, it more important to learn the fundamentals of CG theory (eg rasterization, the rendering equation, solving ODEs/PDEs) than the specifics of any particular implementation (OGL, DX, etc). After taking 462, many students (including myself!) take the class 15-466 Computer Game Programing [0], which goes deep into more modern OpenGL implementations (admittedly, it's OGL 3.3, but it still covers shaders/VBOs/other important concepts that translate to modern APIs).
I wanted to support what you're saying. I've worked in visual effects for over a decade now. While I've dabbled with OpenGL, I think the only practical application I've had was taking a stab at writing a PyOpenGL widget for viewing Alembic models in a custom PySide asset browser for a studio. This was when Alembic was much less mature and it never ended up getting used. However, I have done a lot of dealing with color spaces, debugging/optimizing scanline and ray casting renderers, computing/storing surface normals, projection and other space transforms, and simulations. I've written toy versions of a lot of those things, but mostly was debugging back box systems written by others or troubleshooting assets generated or consumed by one of these tools.
Even if you're talking about game engines, there's still a whole lot more to learn. This game engine book [1] has one chapter about the rendering engine and 16 more about other topics. Each chapter in there is at least one hefty book to get a good working knowledge of the topic.
It's great you have a resource where people can learn and experiment with these other things without having to learn to write all the code around it.
> Even if you're talking about game engines, there's still a whole lot more to learn. This game engine book [1] has one chapter about the rendering engine and 16 more about other topics. Each chapter in there is at least one hefty book to get a good working knowledge of the topic.
Which is why when people without game industry experience start discussing 3D APIs adoption, they loose on how little the APIs actually influence the whole engine codebase.
There's something charming and engaging about the "legacy" fixed function pipeline that we've lost with our increasing focus on lower and lower level APIs. The ability to have a 10 line hello-world program that draws a colored triangle on the screen is magical, and encouraging to beginners, and that experience can't be replaced by the massive boilerplate and "copy-paste-this-stuff-dont-worry-about-what-it-does-yet" you need to do in order to do graphics the more modern way.
With Apple working to eliminate all traces of OpenGL with Metal, and Microsoft already having abandoned it close to two decades ago, I feel it's close to the end of the road for fixed function OpenGL. It was a wonderful part of graphics development history that, sadly, future beginners will likely not be able to experience.
One more instance of how the interests of vendors and the interests of developers are not aligned. Microsoft and Apple don't want you learning portable skills -- they want to limit your future prospects to developing only for their specific platform.
noob here - how come no one is making a cross platform API to abstract away this stuff? whenever i read about opengl or vulkan or metal or whatever w/ the tutorial going "learn this engine to bypass complexities of bare api usage", it's my first thought
Vulkan is supposed to be the cross-platform API, but Apple isn't supporting it (and doing their own thing with Metal, as per usual). From what I've heard, Vulkan was originally "OpenGL 5" so while OpenGL continues to exist Vulkan is effectively its successor. There is MoltenVK, which allows Vulkan applications to run on top of Metal.
SFML (Simple Fast Media Library) and SDL are cross platform and hide most of OpenGL boiler plate. I've used SFML but just to make Shaders and output some text.
I think you're correct that the old GL API made it quick to get stuff working, but the modern GL is much nicer once you get past the overhead of setting up all the VBOs etc. That only needs doing once, and then you're set.
I've recently been following along a vulkan tutorial to get started with that, in the odd evening over the last few weeks. I'm six chapters in, and I've yet to even draw a single triangle. That's still about five chapters away. While I can appreciate that the flexibility of the setup to remove much of the implicit state contained within the GL state machine is good, I can't help but wish for a wrapper to just make it work for a typical scenario, and let me render stuff with a minimum of fuss.
I'm unsure about where Metal will fit in the future. No matter how great it is, it's a vendor-specific proprietary API and I suspect that Vulkan will be the next cross-platform API which will wrap Metal or DX12 when a native Vulkan driver isn't available.
I hope it's not too off-topic, but I feel this sort of thing has happened before with developing GUI programs in Visual Basic, drawing to the screen with turtle graphics, (or instead whatever routines BASIC tended to have). Curious if others share the sentiment with other examples too.
Oh yeah, totally. Maybe I'm looking at it with rose-colored glasses because I was younger, but the way I remember it personal computing used to be about enabling users. At some point there was this huge attitude shift towards being condescending toward users and treating them like cattle. So now, instead of trying to bridge the gap between computer "user" and computer "programmer", we forcibly drive a giant wedge between them.
The underlying issue is that many novice users don't see the problem with being condescended to, even when this severely inconveniences the more "developer-like" power users we used to enable. That "wedge" is just what the Eternal September of personal computing looks like.
Interestingly, many browsers are still susceptible to this attack, for example when used in SVG files (WARNING: might crash your browser and/or operating system): https://jsfiddle.net/e3guLn08/
Browsers are susceptible to a server that generates an infinite HTML page (e.g. CGI shell script calling "yes <arg>"), and also to thing called JavaScript that can eat all your memory programmatically (and does exactly so on a regular basis).
1. Repeating the same thing multiple times with slight variation:
* "allowing developers to fine-tune their applications and unlock the full potential of their underlying hardware, ultimately maximising vLLM performance." (fine-tune, unlock potential, maximize performance are all roughly the same thing)
* "AI and machine learning models" (AI and machine learning models are the same thing in the context of this article)
* "utilise multiple threads or cores" (Why differentiate between threads and cores?)
* "tailored to enhance computational efficiency and overall throughput" (efficiency and throughput are highly related)
* "a series of graphs and data visualisations" (all the data visualizations in this article are graphs)
* "more computational effort and time" (same thing)
* "significantly enhanced the performance and efficiency" (same thing)
* "ensuring efficient processing and superior performance for complex and demanding AI workloads" (same things)
2. Explaining what "rocBLAS" stands for multiple times.
3. Other ChatGPTisms:
* "offering a comprehensive view of [...]"
* "Let’s delve into the notable advancements achieved through [...]"
* "ensures quicker processing times, which is crucial for [...]"
* "effectively mitigated these impacts, maintaining [...]"
* "elucidate the impact of"
* "significantly enhanced"
* "These results underscore the critical role of [...]"
* "Key Aspects", "Key Observations", "Key findings"
So why is this bad? - Because it undermines the trust in the the article. We do not know whether the claims are actually true or whether they were just made up by ChatGPT.