More

m0shen · 2024-12-04T06:34:04 1733294044

I thought it depended on libc by default?

mirashii · 2024-12-04T12:27:03 1733315223

You might find https://mt165.co.uk/blog/static-link-go/ useful, but a quick tl;dr:

Go binaries are statically linked by default if they do not call out to any C functions.

Go will automatically switch to dynamic linking if you need to call out to C functions. There are some parts of the standard library that do this internally, and so many folks assume that go is dynamically linked by default.

m0shen · 2024-10-22T23:30:33 1729639833

Nyancat terminal animation in perl: https://gist.github.com/moshen/1417991

On the slightly more useful side...

gotermimg to play gifs in the terminal: https://github.com/moshen/gotermimg

m0shen · on June 4, 2024

`expect` is absolutely geared towards scripting, as it's an extension of TCL. Though as far as getting a "current terminal view" `expect` has `term_expect`: https://core.tcl-lang.org/expect/file?name=example/term_expe...

andyk · on June 4, 2024

Sorry, my wording wasn't very clear. I wasn't trying to imply that ht is more geared towards scripting than `expect` (in fact I'd say `expect` is more scripting-oriented being an extension of a scripting language) but rather that ht is more geared towards scripting the terminal as a UI than `expect`.

Am I wrong about that? (I may very well be since I haven't used `expect` before)

follower · on June 5, 2024

Based on my understanding/recollection of `expect`, the concept is that you're scripting a command/process (or command sequence) via a "terminal connection" (or basic stdin/stdout), based on the (either complete or partial) "expected" dialogue response.

e.g.

1. make initial connect over ssh (e.g. spawn ssh cli process) 2. expect "login: " response 3. send "admin" 4. expect "password: " response 5. send "password" 6. expect "$ " 7. send "whoami\n" 8. etc etc

I guess that might in theory be possible to script a TUI with but I suspect it'd get pretty convoluted over an extended period of time.

(BTW I mentioned this in a comment elsewhere in this thread but check out https://crates.io/crates/termwiz to avoid re-inventing the wheel for a bunch of terminal-related functionality.)

m0shen · on June 5, 2024

I see what you're saying. When I was writing scripts in `expect`, I didn't really ever try to automate tui programs. So, this could absolutely be a better way to script the the terminal as a ui as you said.

I'll certainly tuck it into my toolbox. Thanks :)

m0shen · on May 30, 2024

Paperless supports OCR + full text indexing: https://docs.paperless-ngx.com/

As far as AI goes, not sure.

whynotmaybe · on May 31, 2024

You can use Gpt4all with localdocs to analyze the folder where you store the output of paperless-ngx

m0shen · on April 16, 2024

I love seeing how many people wrote one of these. I've written 2:

One to learn golang that plays gifs - https://github.com/moshen/gotermimg

And one in perl - https://github.com/moshen/Image-Term256Color

m0shen · on April 15, 2024

This article was written when cosmopolitan: https://justine.lol/cosmopolitan/index.html was v2. v3 has since been released: https://justine.lol/cosmo3/ .

https://cosmo.zip/ Also contains a python build https://cosmo.zip/pub/cosmos/bin/python , though I'm not sure if the problems mentioned in the article are solved.

ahgamut · on April 15, 2024

most likely before v2 as well. The original post is from July 2021, which is a bit after Cosmopolitan Libc's v1 IIRC.

m0shen · on April 15, 2024

Yes, you're correct. I had in mind the 2022 date of the update. It also looks like you've updated the article again today! Thanks

m0shen · on April 13, 2024

Anything that converts to PDF/A will strip Javascript

m0shen · on Feb 16, 2024

At multiple points in my career I've been responsible for apis that accept PDFs. Many non-tech savvy people seeing this, will just change the extension of the file they're uploading to `.pdf`.

To make matters worse, there is some business software out there that will actually bastardize the PDF format and put garbage before the PDF file header. So for some things you end up writing custom validation and cleanup logic anyway.

m0shen · on Feb 16, 2024

This is false in every sense for https://www.darwinsys.com/file/ (probably the most used file version). It depends on the magic for a specific file, but it can check any part of your file. Many Linux distros are years out of date, you might be using a very old version.

FILE_45:

    ./src/file -m magic/magic.mgc ../../OpenCalc.v2.3.1.apk
    ../../OpenCalc.v2.3.1.apk: Android package (APK), with zipflinger virtual entry, with APK Signing Block

supriyo-biswas · on Feb 16, 2024

Interesting! I checked with file 5.44 from Ubuntu 23.10 and 5.45 on macOS using homebrew, and in both cases, I got “Zip archive data, at least v2.0 to extract” for the file here[1]. I don’t have an Android phone to check and I’m also not familiar with Android tooling, so is this a corrupt APK?

[1] https://download.apkpure.net/custom/com.apkpure.aegon-319781...

m0shen · on Feb 16, 2024

That doesn't appear to be a valid link. Try building `file` from source and using the provided default magic database.

supriyo-biswas · on Feb 16, 2024

I also tried this with the sources of file from the homepage you linked above, and I still get the same results.

You could try this for yourself using the same APKPure file which I uploaded at the following alternative link[1]. Further, while this could be a corrupt APK, I can’t see any signs of that from a cursory inspection as both the `classes.dex` and `META-INF` directory are present, and this is APKPure’s own APK, instead of an APK contributed for an app contributed by a third-party.

[1] https://wormhole.app/Mebmy#CDv86juV9H4aRCL2DSJeDw

m0shen · on Feb 16, 2024

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

Though I have to say when looking at the Node module, I don't understand why they released it.

Their docs say it's slow:

https://github.com/google/magika/blob/120205323e260dad4e5877...

It loads the model an runtime:

https://github.com/google/magika/blob/120205323e260dad4e5877...

They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

Also as others have mentioned. The model appears to only detect 116 file types:

https://github.com/google/magika/blob/120205323e260dad4e5877...

Where libmagic detects... a lot. Over 1600 last time I checked:

https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

m0shen · on Feb 16, 2024

Made a small test to try it out: https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...

    hyperfine ./magika.bash ./file.bash
    Benchmark 1: ./magika.bash
      Time (mean ± σ):     706.2 ms ±  21.1 ms    [User: 10520.3 ms, System: 1604.6 ms]
      Range (min … max):   684.0 ms … 738.9 ms    10 runs
    
    Benchmark 2: ./file.bash
      Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
      Range (min … max):    22.4 ms …  29.0 ms    111 runs
    
    Summary
      './file.bash' ran
       29.88 ± 1.65 times faster than './magika.bash'

barrkel · on Feb 16, 2024

Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.

chmod775 · on Feb 16, 2024

Going by those number it's taking almost a second to run, not 10s of ms. And going by those numbers, it's doing something massively parallel in that time. So basically all your cores will spike to 100% for almost a second during those one-shot identifications. It looks like GP has a 12-16 threads CPU, and it is using those while still being 30 times slower than single-threaded libmagic.

That tool needs 100x more CPU time just to figure out some filetypes than vim needs to open a file from a cold start (which presumably includes using libmagic to check the type).

If I had to wait a second just to open something during which that thing uses every resource available on my computer to the fullest, I'd probably break my keyboard. Try using that thing as a drop-in file replacement, open some folder in your favorite file manager, and watch your computer slow to a crawl as your file manager tries to figure out what thumbnails to render.

It's utterly unsuitable for "interactive" identifications.

m0shen · on Feb 16, 2024

My little script is trying to identify in bulk, at least by passing 165 file paths to `magika`, and `file`.

Though, I absolutely agree with you. I think realistically it's better to do this kind of thing in a library rather than shell out to it at all. I was just trying to get an idea on how it generally compares.

Another note, I was trying to be generous to `magicka` here because when it's single file identification, it's about 160-180ms on my machine vs <1ms for `file`. I realize that's going to be quite a bit of python startup in that number, which is why I didn't go with it when pushing that benchmark up earlier. I'll probably push an update to that gist to include the single file benchmark as well.

m0shen · on Feb 16, 2024

I've updated this script with some single-file cli numbers, which are (as expected) not good. Mostly just comparing python startup time for that.

    make
    sqlite3 < analyze.sql
    file_avg              python_avg         python_x_times_slower_single_cli
    --------------------  -----------------  --------------------------------
    0.000874874856301821  0.179884610224334  205.611818568799
    file_avg            python_avg     python_x_times_slower_bulk_cli
    ------------------  -------------  ------------------------------
    0.0231715865881818  0.69613745142  30.0427184289163

ebursztein · on Feb 16, 2024

We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files

m0shen · on Feb 16, 2024

Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.

I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.

I'll probably take some time this weekend to make a couple of issues around misidentified files.

I'll definitely be adding this to my toolset!

invernizzi · on Feb 16, 2024

Hello! We wrote the Node library as a first functional version. Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file. Other than that, it's just as fast for single file lookups, which is the most common usecase.

m0shen · on Feb 16, 2024

Good to know! Thank you. I'll definitely be trying it out. Though, I might download and hardcode the model ;)

I also appreciate the use of ONNX here, as I'm already thinking about using another version of the runtime.

Do you think you'll open source your F1 benchmark?

tudorw · on Feb 16, 2024

Can we do the 1600 if known, if not, let the AI take a guess?

m0shen · on Feb 16, 2024

Absolutely, and honestly in a non-interactive ingestion workflow you're probably doing multiple checks anyway. I've worked with systems that call multiple libraries and hand-coded validation for each incoming file.

Maybe it's my general malaise, or disillusionment with the software industry, but when I wrote that I was really just expecting more.

michaelt · on Feb 16, 2024

> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked

As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.

For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.

Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.

theon144 · on Feb 16, 2024

Okay, but your one file type is more likely to be included in the 1600 that libmagic supports rather than Magika's 116?

For that matter, the file types I care about are unfortunately misdetected by Magika (which is also an important point - the `file` command at least gives up and says "data" when it doesn't know, whereas the Magika demo gives a confidently wrong answer).

I don't want to criticize the release because it's not meant to be a production-ready piece of software, and I'm sure the current 116 types isn't a hard limit, but I do understand the parent comment's contention.

eapriv · on Feb 17, 2024

Surely identifying just one file type (or two, as in your example) is a much simpler task that shouldn’t rely on horribly inefficient and imprecise “AI” tools?

lebean · on Feb 16, 2024

It's for researchers, probably.

m0shen · on Feb 16, 2024

Yeah, there is this line:

    By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Which implies a production-ready release for general usage, as well as usage by security researchers.