I had a ton of fun using Kaitai to write an unpacking script for a video game's proprietary pack file format. Super cool project.
I did NOT have fun trying to use Kaitai to pack the files back together. Not sure if this has improved at all but a year or so ago you had to build dependencies yourself and the process was so cumbersome it ended up being easier to just write imperative code to do it myself.
Yep. I've ran into this using Bugsnag for reporting on unhandled exceptions in Python-based lambda functions. The exception handler would get called, but because the library is async by default the HTTP request wouldn't make it out before the runtime was torn down.
I sympathize with OP because debugging this was painful, but I'm sorry to say this is sort of just a "you're holding it wrong" situation.
It's not AI but Ghidra has a cool feature called BSim which does something similar. Each function get's a "feature vector" which now that I think about it has some clear parallels to embeddings.
Wow that is cool, I bet with that feature and a huge database of known "feature vectors" from open-source libraries so you can focus on the actual business logic of the binary instead of trying to reverse external library functions
It's not all that small, although probably small enough to make a rainbow table or something.
You would have to maintain the code to generate character-perfect strings (or maybe just keep a very large library of the current most popular ones) and also make sure you have the up to date API key salt values (which they probably going to start rotating regularly), which–as I said before–wouldn't be impossible, just prohibitively irritating to maintain for comparatively little benefit.
And besides, it won't be too long before people just start spoofing the hash too, probably shorter than getting the generator up and running
Is the benefit of using a language server as opposed to just giving access to the codebase simply a reduction in the amount of tokens used? Or are there other benefits?
Beyond saving tokens, this greatly improved the quality and speed of answers: the language server (most notably used to find the declaration/definition of an identifier) gives the LLM
1. a shorter path to relevant information by querying for specific variables or functions rather than longer investigation of source code. LLMs are typically trained/instructed to keep their answers within a range of tokens, so keeping shorter conversations when possible extends the search space the LLM will be "willing" to explore before outputting a final answer.
2. a good starting point in some cases by immediately inspecting suspicious variables or function calls. In my experience this happens a lot in our Python implementation, where the first function calls are typically `info` calls to gather background on the variables and functions in frame.
Yes. It lets the LLM immediately obtain precise information rather than having to reason across the entire source code of the code base (which ChatDBG also enables). For example (from the paper, Section 4.6):
The second command, `definition`, prints the location and source
code for the definition corresponding to the first occurrence of a symbol on a
given line of code. For example, `definition polymorph.c:118` target prints the
location and source for the declaration of target corresponding to its use on
that line. The definition implementation
leverages the `clangd` language server, which supports source code queries via
JSON-RPC and Microsoft’s Language Server Protocol.
I did NOT have fun trying to use Kaitai to pack the files back together. Not sure if this has improved at all but a year or so ago you had to build dependencies yourself and the process was so cumbersome it ended up being easier to just write imperative code to do it myself.