hyperfine ./magika.bash ./file.bash
Benchmark 1: ./magika.bash
Time (mean ± σ): 706.2 ms ± 21.1 ms [User: 10520.3 ms, System: 1604.6 ms]
Range (min … max): 684.0 ms … 738.9 ms 10 runs
Benchmark 2: ./file.bash
Time (mean ± σ): 23.6 ms ± 1.1 ms [User: 15.7 ms, System: 7.9 ms]
Range (min … max): 22.4 ms … 29.0 ms 111 runs
Summary
'./file.bash' ran
29.88 ± 1.65 times faster than './magika.bash'
Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.
Going by those number it's taking almost a second to run, not 10s of ms. And going by those numbers, it's doing something massively parallel in that time. So basically all your cores will spike to 100% for almost a second during those one-shot identifications. It looks like GP has a 12-16 threads CPU, and it is using those while still being 30 times slower than single-threaded libmagic.
That tool needs 100x more CPU time just to figure out some filetypes than vim needs to open a file from a cold start (which presumably includes using libmagic to check the type).
If I had to wait a second just to open something during which that thing uses every resource available on my computer to the fullest, I'd probably break my keyboard. Try using that thing as a drop-in file replacement, open some folder in your favorite file manager, and watch your computer slow to a crawl as your file manager tries to figure out what thumbnails to render.
It's utterly unsuitable for "interactive" identifications.
My little script is trying to identify in bulk, at least by passing 165 file paths to `magika`, and `file`.
Though, I absolutely agree with you. I think realistically it's better to do this kind of thing in a library rather than shell out to it at all. I was just trying to get an idea on how it generally compares.
Another note, I was trying to be generous to `magicka` here because when it's single file identification, it's about 160-180ms on my machine vs <1ms for `file`. I realize that's going to be quite a bit of python startup in that number, which is why I didn't go with it when pushing that benchmark up earlier. I'll probably push an update to that gist to include the single file benchmark as well.