Go binaries are statically linked by default if they do not call out to any C functions.
Go will automatically switch to dynamic linking if you need to call out to C functions. There are some parts of the standard library that do this internally, and so many folks assume that go is dynamically linked by default.
Sorry, my wording wasn't very clear. I wasn't trying to imply that ht is more geared towards scripting than `expect` (in fact I'd say `expect` is more scripting-oriented being an extension of a scripting language) but rather that ht is more geared towards scripting the terminal as a UI than `expect`.
Am I wrong about that? (I may very well be since I haven't used `expect` before)
Based on my understanding/recollection of `expect`, the concept is that you're scripting a command/process (or command sequence) via a "terminal connection" (or basic stdin/stdout), based on the (either complete or partial) "expected" dialogue response.
I guess that might in theory be possible to script a TUI with but I suspect it'd get pretty convoluted over an extended period of time.
(BTW I mentioned this in a comment elsewhere in this thread but check out https://crates.io/crates/termwiz to avoid re-inventing the wheel for a bunch of terminal-related functionality.)
I see what you're saying. When I was writing scripts in `expect`, I didn't really ever try to automate tui programs. So, this could absolutely be a better way to script the the terminal as a ui as you said.
At multiple points in my career I've been responsible for apis that accept PDFs. Many non-tech savvy people seeing this, will just change the extension of the file they're uploading to `.pdf`.
To make matters worse, there is some business software out there that will actually bastardize the PDF format and put garbage before the PDF file header. So for some things you end up writing custom validation and cleanup logic anyway.
This is false in every sense for https://www.darwinsys.com/file/ (probably the most used file version). It depends on the magic for a specific file, but it can check any part of your file. Many Linux distros are years out of date, you might be using a very old version.
FILE_45:
./src/file -m magic/magic.mgc ../../OpenCalc.v2.3.1.apk
../../OpenCalc.v2.3.1.apk: Android package (APK), with zipflinger virtual entry, with APK Signing Block
Interesting! I checked with file 5.44 from Ubuntu 23.10 and 5.45 on macOS using homebrew, and in both cases, I got “Zip archive data, at least v2.0 to extract” for the file here[1]. I don’t have an Android phone to check and I’m also not familiar with Android tooling, so is this a corrupt APK?
I also tried this with the sources of file from the homepage you linked above, and I still get the same results.
You could try this for yourself using the same APKPure file which I uploaded at the following alternative link[1]. Further, while this could be a corrupt APK, I can’t see any signs of that from a cursory inspection as both the `classes.dex` and `META-INF` directory are present, and this is APKPure’s own APK, instead of an APK contributed for an app contributed by a third-party.
As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.
Though I have to say when looking at the Node module, I don't understand why they released it.
hyperfine ./magika.bash ./file.bash
Benchmark 1: ./magika.bash
Time (mean ± σ): 706.2 ms ± 21.1 ms [User: 10520.3 ms, System: 1604.6 ms]
Range (min … max): 684.0 ms … 738.9 ms 10 runs
Benchmark 2: ./file.bash
Time (mean ± σ): 23.6 ms ± 1.1 ms [User: 15.7 ms, System: 7.9 ms]
Range (min … max): 22.4 ms … 29.0 ms 111 runs
Summary
'./file.bash' ran
29.88 ± 1.65 times faster than './magika.bash'
Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.
Going by those number it's taking almost a second to run, not 10s of ms. And going by those numbers, it's doing something massively parallel in that time. So basically all your cores will spike to 100% for almost a second during those one-shot identifications. It looks like GP has a 12-16 threads CPU, and it is using those while still being 30 times slower than single-threaded libmagic.
That tool needs 100x more CPU time just to figure out some filetypes than vim needs to open a file from a cold start (which presumably includes using libmagic to check the type).
If I had to wait a second just to open something during which that thing uses every resource available on my computer to the fullest, I'd probably break my keyboard. Try using that thing as a drop-in file replacement, open some folder in your favorite file manager, and watch your computer slow to a crawl as your file manager tries to figure out what thumbnails to render.
It's utterly unsuitable for "interactive" identifications.
My little script is trying to identify in bulk, at least by passing 165 file paths to `magika`, and `file`.
Though, I absolutely agree with you. I think realistically it's better to do this kind of thing in a library rather than shell out to it at all. I was just trying to get an idea on how it generally compares.
Another note, I was trying to be generous to `magicka` here because when it's single file identification, it's about 160-180ms on my machine vs <1ms for `file`. I realize that's going to be quite a bit of python startup in that number, which is why I didn't go with it when pushing that benchmark up earlier. I'll probably push an update to that gist to include the single file benchmark as well.
We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.
The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.
The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.
Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.
I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.
I'll probably take some time this weekend to make a couple of issues around misidentified files.
Hello!
We wrote the Node library as a first functional version.
Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file.
Other than that, it's just as fast for single file lookups, which is the most common usecase.
Absolutely, and honestly in a non-interactive ingestion workflow you're probably doing multiple checks anyway. I've worked with systems that call multiple libraries and hand-coded validation for each incoming file.
Maybe it's my general malaise, or disillusionment with the software industry, but when I wrote that I was really just expecting more.
> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked
As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.
For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.
Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.
Okay, but your one file type is more likely to be included in the 1600 that libmagic supports rather than Magika's 116?
For that matter, the file types I care about are unfortunately misdetected by Magika (which is also an important point - the `file` command at least gives up and says "data" when it doesn't know, whereas the Magika demo gives a confidently wrong answer).
I don't want to criticize the release because it's not meant to be a production-ready piece of software, and I'm sure the current 116 types isn't a hard limit, but I do understand the parent comment's contention.
Surely identifying just one file type (or two, as in your example) is a much simpler task that shouldn’t rely on horribly inefficient and imprecise “AI” tools?
By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.
Which implies a production-ready release for general usage, as well as usage by security researchers.