robko's comments

robko · on June 29, 2024

I agree. This article is clearly written with a generative language model. A few other telltale signs:

1. Repeating the same thing multiple times with slight variation:

* "allowing developers to fine-tune their applications and unlock the full potential of their underlying hardware, ultimately maximising vLLM performance." (fine-tune, unlock potential, maximize performance are all roughly the same thing)

* "AI and machine learning models" (AI and machine learning models are the same thing in the context of this article)

* "utilise multiple threads or cores" (Why differentiate between threads and cores?)

* "tailored to enhance computational efficiency and overall throughput" (efficiency and throughput are highly related)

* "a series of graphs and data visualisations" (all the data visualizations in this article are graphs)

* "more computational effort and time" (same thing)

* "significantly enhanced the performance and efficiency" (same thing)

* "ensuring efficient processing and superior performance for complex and demanding AI workloads" (same things)

2. Explaining what "rocBLAS" stands for multiple times.

3. Other ChatGPTisms:

* "offering a comprehensive view of [...]"

* "Let’s delve into the notable advancements achieved through [...]"

* "ensures quicker processing times, which is crucial for [...]"

* "effectively mitigated these impacts, maintaining [...]"

* "elucidate the impact of"

* "significantly enhanced"

* "These results underscore the critical role of [...]"

* "Key Aspects", "Key Observations", "Key findings"

So why is this bad? - Because it undermines the trust in the the article. We do not know whether the claims are actually true or whether they were just made up by ChatGPT.

Eliovp · on June 29, 2024

What if that person is not native English and wrote something up and then threw it into chatgpt (or a local chatbot running on 1 MI300x :p) just because he felt that his relatively limited vocabulary would not be enough to express everything?

That person (yeah :p), might just be trying to create as much awareness as possible.

You might get annoyed by the usage of LLM's, some might not. I get annoyed by people still trying to undermine the testing done while everything is clearly extremely transparant, even the docker image is shared..

That said, the article is about the results, if you'd like to "delve" a bit deeper into those results, let me know, i'd be happy to go over some of the data visualisations ;-)

fancyfredbot · on July 1, 2024

If you want to talk about the results then there are quite a few comments (from me!) asking about those ;-)

Snark aside I do want to thank you and others for running these tests. I just wish I could make sense of the results, which seem too good to be true?

idonotknowwhy · on June 29, 2024

Thanks for articulating something I've noticed happening all over the internet and even in YouTube video scripts.

Are claudisms different from gptisms?

Why can't these authors tell ChatGPT to write with a different prose and avoid "delve", "crucial", etc?

Wizardlm writes these same things in all its answers too

robko · on June 29, 2024

> Are claudisms different from gptisms?

Sorry, no idea. I rarely use Claude.

> Why can't these authors tell ChatGPT to write with a different prose and avoid "delve", "crucial", etc?

The authors could do this, but that would contradict the reason for using ChatGPT, which is to do less work.

> Wizardlm writes these same things in all its answers too

WizardLM has inherited that from its instruction tuning dataset, which has been generated with ChatGPT: https://openreview.net/pdf?id=CfXh93NDgH

robko · on June 29, 2024

> Loras are just as powerful as a finetuned model and you can train one in minutes even on consumer hardware.

Do you have some more details on training a LoRA in minutes? Last I tried, it took several hours on an RTX 3090, but I am sure there have been improvements since then.

portaouflop · on June 29, 2024

Not really, but there are lots of guides around, for example https://civitai.com/models/351583/sdxl-pony-fast-training-gu...

Tbh I mostly use loras others trained, there are hundreds around for all kinds of things

robko · on Feb 15, 2019

In my experience, the canvas api is very slow and not well thought-out. For example, to create a native image object from raw pixels, you have to copy the pixels into an ImageData object, draw it to a canvas, create a data URL from the canvas and then load an image from that data URL.

robko · on Feb 14, 2019

You are correct. The code is using an inefficient cache access pattern, so most of the time is spent waiting.

You probably won't get 100x faster without SIMD, but 10x is certainly doable. Unfortunately, SIMD.js support has been removed from Chrome and Firefox a while ago, even though it is not available in wasm to this day.

kllrnohj · on Feb 14, 2019

How would SIMD do anything to address the problem's fundamental anti-cache-friendly access patterns? You'd need to restructure the problem to be cache-friendly, but SIMD won't really be relevant to that.

robko · on Feb 15, 2019

You can use both at once. Usually, you'd have something like 64x64 tiles in cache and use 4x4 or 8x8 tiles for SIMD.

robko · on Feb 14, 2019

My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps. Doing 90-degree image rotation with fixed steps and some index calculations should work better (0.18 sec vs 1.5 sec for their implementation in node.js):

    for (var y = 0; y < height; y++)
        for (var x = 0; x < width; x++)
            b[x + y*width] = a[y + (width - 1 - x)*height];

Although that's still far from the theoretical maximum throughput because the cache utilization is really bad. If you apply loop tiling, it should be even faster. This problem is closely related to matrix transpose, so there is a great deal of research you can build upon.

EDIT: 0.07 seconds with loop tiling:

    for (var y0 = 0; y0 < height; y0 += 64){
        for (var x0 = 0; x0 < width; x0 += 64){
            for (var y = y0; y < y0 + 64; y++){
                for (var x = x0; x < x0 + 64; x++){
                    b[x + y*width] = a[y + (width - 1 - x)*height];

acqq · on Feb 14, 2019

Your 0.18 sec result is (to use the units they used in the article) 180ms, and if I understand correctly their best webassembly compiled and executed result (?) is 300ms. Beautiful.

EDIT: But it could also be that your computer is somewhat faster than theirs? Do you happen to have some very fast CPU? Can you say which? When I run C-like C++ versions of your code I get the speeds you get with node.js. However, you made overall much better results than they were able, it's still great work!

    #include <stdio.h>
    int main(int argc, char* argv[]) {
        enum { height = 4096, width = 4096 };
        unsigned* a = new unsigned[ height*width ];
        unsigned* b = new unsigned[ height*width ];
        if ( argc < 2 ) { // call with no params
            // to measure overhead when just allocations
            // and no calculations are done
            printf( "%d %d\n", (int)a, (int)b );
            return 1;
        }
        if ( argv[1][0] == '1' ) // call with 1 the fastest
        for (unsigned y0 = 0; y0 < height; y0 += 64)
            for (unsigned x0 = 0; x0 < width; x0 += 64)
                for (unsigned y = y0; y < y0 + 64; y++)
                    for (unsigned x = x0; x < x0 + 64; x++)
                        b[x + y*width] = a[y + (width - 1 - x)*height];
        else
        for (unsigned y = 0; y < height; y++)
            for (unsigned x = 0; x < width; x++)
                b[x + y*width] = a[y + (width - 1 - x)*height];

        return 0;
    }

vijaybritto · on Feb 15, 2019

I think its fast because of the L1 cache or something like that. I dont understand fully but this is what i got

acqq · on Feb 15, 2019

The fastest version is the fastest because it's the most cache-friendly one of all which were presented. See e.g.

https://stackoverflow.com/questions/5200338/a-cache-efficien...

But note that robko made an improvement even before making that.

acqq · on Feb 16, 2019

> made an improvement even before

Or maybe not: my short experiments with the simplified version based on their algorithm and his JavaScript versions gave some conflicting results. I haven't thoroughly verified them, this note is just to motivate the others to try.

robko · on Feb 15, 2019

I get 60ms in C. But in your code, the compiler might decide to remove most of the code since b is not used after being calculated. I checked the assembly code and it does not seem to be the case here, but it's still something to be aware of.

acqq · on Feb 15, 2019

> I get 60ms in C

OK, I get cca 80ms for my run with the parameter 1 on my main computer, and 200ms on N3150 Celeron.

> b is not used after being calculated

Earlier, I've never seen that any C compiler optimizes away the call to the allocator and the access to the so allocated arrays. Maybe it's different now? Hm, dead code elimination... I guess a random init of the few values before and read and print of a few values after the loop must be always safe... Now that I think, also filling the array with zeroes before.

seanwilson · on Feb 14, 2019

Maybe this is what you meant but the snippet can be optimised a ton as well unless I'm missing something:

- Move the "y * width" calculation outside of the "for x" loop.

- The multiply operators can be replaced with addition e.g. replace "y * width" with "counter += width" each y iteration and similarly for the x loop.

Optimising inner loops is really fun.

How much of the speed up in the article is because the JS engine can't figure out how to optimise it compared to the WebAssembly compiler?

tom_mellior · on Feb 15, 2019

These code motion/strength reduction optimizations are standard even in mildly optimizing compilers. I would be very surprised if an optimizing JavaScript compiler did not perform them automatically.

robko · on Feb 15, 2019

I tried a few micro-optimizations, but they did not make a measurable difference, so I kept the code short instead. But maybe some JIT is particularly bad at loop hoisting, so it might make a difference there.

dassurma · on Feb 15, 2019

Huh interesting! I always disliked butchering code to do processor cache optimizations and I kinda worked under the impression that a browser’s JS and wasm compilers would do these optimizations for me.

I’ll definitely give tiling a spin (although at this point we are definitely fast enough™️)

yalok · on Feb 15, 2019

Can someone please explain why loop tiling increases performance in JS so dramatically? Is it mainly due to the fact that inner loops have constant size (64) and get called more frequently, and thus get promoted faster into deeper stages of JS runtime optimization?

My guess is that if you try to invoke initial whole code (before tiling) in a external loop (rotating images of exactly the same size), you will get similar perf boost (not that it has practical implication, but just to understand how optimization works).

vardump · on Feb 15, 2019

No, it's faster because the working set of 64 * 64 * 4 * 2 bytes can (almost) fit in CPU core L1 cache. Further cache levels are slower and finally the memory is glacially slow.

WASM example would speed up as well using the same approach. Or C, Rust or whatever.

fulafel · on Feb 15, 2019

To add background, this is a standard optimization technique that has been employed in eg fortran compilers since at least the 1980s.

Asooka · on Feb 15, 2019

Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.

fulafel · on Feb 15, 2019

The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed. After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.

(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).

BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...

bufferoverflow · on Feb 16, 2019

I tested these two pieces of code in different browsers on i7-8750H with 16GB of RAM.

Chrome: 248 ms vs 93 ms

Firefox: 552 ms vs 93 ms

MS Edge: 7486 ms vs 6186 ms

IE: 9590 ms vs 9156 ms

These are some WTF results, to be honest.

maxgraey · on Feb 14, 2019

https://www.reddit.com/r/programming/comments/aqpjkx/replaci...

seanwilson · on Feb 15, 2019

> As I understand they the main goal was to achieve easily readable and maintainable code, even to the detriment of performance.

Seems like a tricky goal for image algorithms in general where you're performing the same action over and over on millions of pixels. Obscure inner loop optimisations are pretty much required.

In these situations, I would sometimes keep the code for the naive but slow version around next to the highly optimised but difficult to understand version. You can compare the output of them to find bugs as well.

RivieraKid · on Feb 15, 2019

> My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps.

Why would non-1 for loop be slower in some browsers? Does the compiler add some sort of prefetch instruction in the faster browsers based on the loop increment?

robko · on Jan 26, 2019

rand() [1] is not a secure source of randomness. I strongly recommend against using that password manager since it is not secure.

[1] https://github.com/oormicreations/OormiPass/blob/d1d2bf5100f...

_8j50 · on Jan 26, 2019

+1 hope this comment is top. They use it for salt and password generation: https://github.com/oormicreations/OormiPass/blob/1a6f0b99613...

Also,

> The master password is not stored. An SHA256 salted hash is stored instead.

Have they heard of scrypt or argon2? And they're not using a KDF(!!!!!!) With sha256.

Dear Authors,if you ever read this,please look at: https://cryptopp.com/wiki/Key_Derivation_Function

And

https://cryptopp.com/wiki/RandomNumberGenerator

hn_throwaway_99 · on Jan 26, 2019

All of these top comments remind me of the general adage that encryption and security are really hard, and if you think you know what you're doing you probably don't.

I don't want to be too harsh for a ShowHN, but even if the authors fixed the several bugs that have already been reported here, it's clear they don't have a foundational enough understanding of cryptography and security to be writing a password manager. I would suggest they spend more time understanding the basics first.

robko · on Jan 6, 2019

Ah, the good old OpenGL fixed function pipeline, deprecated for over 11 years now.

For something a bit more modern, I'd recommend [0], but one might argue that old OpenGL is easier to learn since you don't have to setup your own shaders.

[0] https://learnopengl.com/Introduction

flafla2 · on Jan 6, 2019

To clarify, the point of Scotty3D is not to teach OpenGL -- the students do not write any OpenGL, DirectX, Vulkan, etc in the class. The OpenGL that is there is simply used to render the 3D models and the UI, so updating that code is pretty low priority.

One of my longer-term goals as a TA for the class is to update Scotty3D to Vulkan or modern OpenGL.

EDIT: I wanted to expand on this point, as this is actually an important part of the philosophy of the design of the course. As the OP argues as well, it more important to learn the fundamentals of CG theory (eg rasterization, the rendering equation, solving ODEs/PDEs) than the specifics of any particular implementation (OGL, DX, etc). After taking 462, many students (including myself!) take the class 15-466 Computer Game Programing [0], which goes deep into more modern OpenGL implementations (admittedly, it's OGL 3.3, but it still covers shaders/VBOs/other important concepts that translate to modern APIs).

[0] http://graphics.cs.cmu.edu/courses/15-466-f18/

pfranz · on Jan 6, 2019

I wanted to support what you're saying. I've worked in visual effects for over a decade now. While I've dabbled with OpenGL, I think the only practical application I've had was taking a stab at writing a PyOpenGL widget for viewing Alembic models in a custom PySide asset browser for a studio. This was when Alembic was much less mature and it never ended up getting used. However, I have done a lot of dealing with color spaces, debugging/optimizing scanline and ray casting renderers, computing/storing surface normals, projection and other space transforms, and simulations. I've written toy versions of a lot of those things, but mostly was debugging back box systems written by others or troubleshooting assets generated or consumed by one of these tools.

Even if you're talking about game engines, there's still a whole lot more to learn. This game engine book [1] has one chapter about the rendering engine and 16 more about other topics. Each chapter in there is at least one hefty book to get a good working knowledge of the topic.

It's great you have a resource where people can learn and experiment with these other things without having to learn to write all the code around it.

[1] https://www.gameenginebook.com/toc.html

pjmlp · on Jan 7, 2019

> Even if you're talking about game engines, there's still a whole lot more to learn. This game engine book [1] has one chapter about the rendering engine and 16 more about other topics. Each chapter in there is at least one hefty book to get a good working knowledge of the topic.

Which is why when people without game industry experience start discussing 3D APIs adoption, they loose on how little the APIs actually influence the whole engine codebase.

ryandrake · on Jan 7, 2019

There's something charming and engaging about the "legacy" fixed function pipeline that we've lost with our increasing focus on lower and lower level APIs. The ability to have a 10 line hello-world program that draws a colored triangle on the screen is magical, and encouraging to beginners, and that experience can't be replaced by the massive boilerplate and "copy-paste-this-stuff-dont-worry-about-what-it-does-yet" you need to do in order to do graphics the more modern way.

With Apple working to eliminate all traces of OpenGL with Metal, and Microsoft already having abandoned it close to two decades ago, I feel it's close to the end of the road for fixed function OpenGL. It was a wonderful part of graphics development history that, sadly, future beginners will likely not be able to experience.

rectang · on Jan 7, 2019

One more instance of how the interests of vendors and the interests of developers are not aligned. Microsoft and Apple don't want you learning portable skills -- they want to limit your future prospects to developing only for their specific platform.

kkarakk · on Jan 7, 2019

noob here - how come no one is making a cross platform API to abstract away this stuff? whenever i read about opengl or vulkan or metal or whatever w/ the tutorial going "learn this engine to bypass complexities of bare api usage", it's my first thought

jcelerier · on Jan 7, 2019

> noob here - how come no one is making a cross platform API to abstract away this stuff?

... but there are hundred of cross-platform APIs to abstract this stuff - unity3d, unreal engine, qt3d, etc...

orbat · on Jan 8, 2019

Unity or Unreal aren't 3D rendering APIs, they're game engines

miketuritzin · on Jan 7, 2019

Vulkan is supposed to be the cross-platform API, but Apple isn't supporting it (and doing their own thing with Metal, as per usual). From what I've heard, Vulkan was originally "OpenGL 5" so while OpenGL continues to exist Vulkan is effectively its successor. There is MoltenVK, which allows Vulkan applications to run on top of Metal.

chrisbennet · on Jan 8, 2019

SFML (Simple Fast Media Library) and SDL are cross platform and hide most of OpenGL boiler plate. I've used SFML but just to make Shaders and output some text.

rleigh · on Jan 8, 2019

I think you're correct that the old GL API made it quick to get stuff working, but the modern GL is much nicer once you get past the overhead of setting up all the VBOs etc. That only needs doing once, and then you're set.

I've recently been following along a vulkan tutorial to get started with that, in the odd evening over the last few weeks. I'm six chapters in, and I've yet to even draw a single triangle. That's still about five chapters away. While I can appreciate that the flexibility of the setup to remove much of the implicit state contained within the GL state machine is good, I can't help but wish for a wrapper to just make it work for a typical scenario, and let me render stuff with a minimum of fuss.

I'm unsure about where Metal will fit in the future. No matter how great it is, it's a vendor-specific proprietary API and I suspect that Vulkan will be the next cross-platform API which will wrap Metal or DX12 when a native Vulkan driver isn't available.

gugagore · on Jan 7, 2019

I hope it's not too off-topic, but I feel this sort of thing has happened before with developing GUI programs in Visual Basic, drawing to the screen with turtle graphics, (or instead whatever routines BASIC tended to have). Curious if others share the sentiment with other examples too.

AnIdiotOnTheNet · on Jan 7, 2019

Oh yeah, totally. Maybe I'm looking at it with rose-colored glasses because I was younger, but the way I remember it personal computing used to be about enabling users. At some point there was this huge attitude shift towards being condescending toward users and treating them like cattle. So now, instead of trying to bridge the gap between computer "user" and computer "programmer", we forcibly drive a giant wedge between them.

zozbot123 · on Jan 7, 2019

The underlying issue is that many novice users don't see the problem with being condescended to, even when this severely inconveniences the more "developer-like" power users we used to enable. That "wedge" is just what the Eternal September of personal computing looks like.

bartvbl · on Jan 7, 2019

I wrote a small booklet on modern OpenGL for the class I'm TA'ing:

https://github.com/bartvbl/A-Hitchhikers-Guide-to-OpenGL

robko · on Jan 6, 2019

Interestingly, many browsers are still susceptible to this attack, for example when used in SVG files (WARNING: might crash your browser and/or operating system): https://jsfiddle.net/e3guLn08/

kazinator · on Jan 7, 2019

Browsers are susceptible to a server that generates an infinite HTML page (e.g. CGI shell script calling "yes <arg>"), and also to thing called JavaScript that can eat all your memory programmatically (and does exactly so on a regular basis).

robko · on Dec 18, 2018

This method requires a video or at least multiple frames. It won't work for a single photo.

robko · on Dec 18, 2018

Are you already using GPU acceleration? I'm currently working on alpha matting for my master's thesis and found that it helps quite a lot.

Tuxa · on Dec 18, 2018