> I could probably teach a solid generalist who cared to get to the level of bei...

myself248 · on March 29, 2023

Seriously. There's lots of "you're already a rocket scientist so let's talk details" content out there, but very little with this incredibly-useful-sounding aim.

The trick is calibrating what a "solid generalist" means. I think I'd describe myself that way, but perhaps not among the HN crowd. Would be very interested in being a soundboard for such content, if that's helpful.

motohagiography · on March 29, 2023

thanks! it would to take longer to write, but a basic entry point starts with what you want to know about something: who, what, when, where, why, and how. Trouble is, we tend to start with the a complete picture of How without a sense of the rest to guide us.

What you want to know about a strange binary (barring obfuscation, sandbox escapes, and other nasties) is: who does it talk to (ip addresses, hostnames, sockets, etc), what does it open (files, registry entries, api's, services), when does it do these things (eg. runtime conditions, magic packets, port knocks, triggers, checking for other software), where does it write or read data (directories, filehandles, remote sites, etc), why does it do this given it's stated purpose (why does it have an encrypted section, and where is its key, is it using weird encoding to bypass filters, etc.) and then finally the How it does these all things is the effect of answering those other questions.

I think the hardest part of analysis is having an organized way of knowing what you are looking for because we don't know the right questions to ask and we tend to work at the edge of limited knowledge. Should this rando binary be talking some app hosting site, and why? Why would a developer encode endpoint names in a lookup table that only constructs and returns them at runtime? Why would someone use any of these libraries or data formats on purpose? The harder it is to answer these questions, the more suspicious I get.

If you start with the 5-W's, the How falls out of that a lot faster. If you can answer these questions about a binary, you're easily 50% there in determining whether it behaves as expected. Having an organized goal can take you from zero to basically useful if you answer those questions about it. The rest is just screenshots of menu items in ghidra and maybe cyberchef for purely static extraction.

I feel like I should pile on caveats here about how most malware isn't obfuscated or using novel techniques, a lot of it is just spyware capabilities you clicked through to accept, or a repackaged legit binary with some downloaded RAT attached and some nested compressed libraries. I'm sure someone who is more serious about this will say, "that's misleadingly simple!" but once you have a why and a what, the how is a work problem.

Dynamic debugging and stepping through is the next stage. It's also basic, but when you are goal oriented instead of being able to reproduce all usable code paths, it's more achievable. If you get the IP addresses out of random binary and what protocols it's talking, and maybe what files it accesses, it means you've set up your analysis environment and done the initial checks, and that's valuable grunt work you can pass on to someone with deeper skills.

If you can go from zero to this, that's an afternoon well spent, imo. It's not trivial, as it assumes a lot of knowledge about system architecture and network protocols, but the questions above necessarily have answers, so I can guarantee you can find them with some directed effort. I don't mean to trivialize more advanced analysis, this isn't the same thing, but as an entry point, this is how I would recommend approaching it.

myself248 · on March 29, 2023

That's an incredibly useful model for how to approach the problem! And it sounds like exactly the questions I find myself asking about random suspected-malware, which is often precisely your original example -- a burned CD included with some aliexpress hardware.

I'm familiar with 'strings' and I've been playing with 'binwalk' to take apart files, but I'm out of my depth when it comes to loading something up in a debugger or whatever (is ghidra a debugger or what's the difference?) and looking at code. I don't speak C, and everything seems to look like C when it's shown in the examples of these things. How do I know if I'm looking at a sensible decompilation with actual runnable code or just gibberish because I'm trying to interpret a jpeg as an executable?

I don't know if that makes me teachable or beyond help, but I'd be an eager student.

doktrin · on March 29, 2023

> I'm out of my depth when it comes to loading something up in a debugger or whatever (is ghidra a debugger or what's the difference?)

When you hear "debugger", think "breakpoints". It's any tool that lets you do things like set breakpoints and step through code execution.

Most debuggers will let you view machine code or bytecode respectively, but they won't decompile binaries or bytecode into the original higher level language.

Ghidra does include a basic debugger, but it can also do lots of other stuff (including decompilation).

> I don't know if that makes me teachable or beyond help, but I'd be an eager student.

It would probably help to get some baseline familiarity with systems programming. Check out the "15-213" CS course. The lectures are on YT, the reference book is probably online, and the labs are here :

https://www.cs.cmu.edu/~213/labs.html

pmoriarty · on March 29, 2023

"I don't speak C, and everything seems to look like C when it's shown in the examples of these things."

If you know how to program you could probably already make sense of a lot of C, and for the rest you could try asking an AI to explain it to you.

andai · on March 29, 2023

And if you learn a bit of assembly first, C will seem like a high level language again!

extrememacaroni · on March 29, 2023

When it comes to stuff that results in calls to dynamically linked libs e.g. OpenFile or whatever, you can also use Frida to intercept the calls and print out info about them/manipulate the inputs and/or return values. The advantage of Frida is that it uses JS to do this.

You need to run the executable to do this tho so maybe use a VM.

I used frida a few times to do random stuff like making foobar2000 always play the same mp3 regardless of what is in the playlist, and made a game's speed adjustable by intercepting calls to system gettime and changing the value.

Use Ghidra to check what the executable imports and intercept found functions in Frida.

DyslexicAtheist · on March 29, 2023

> I'm sure someone who is more serious about this will say, "that's misleadingly simple!" but once you have a why and a what, the how is a work problem.

loved your post. I'm by far a lot less experienced for sure. There is one thing in this sentenced that stood out because my order of what to address first is always what and how.

Only at the end the why (e.g motive) might or might not become visible. It has saved me from jumping to premature conclusion (or attribution) in the past ...

Most "who-dunnit" genre of films are based on making you believe the why is the ultimate goal. For me though the why is a by-product of addressing the what/how and I find things remain smoother and with less rabbit holes to get lost in.

andai · on March 29, 2023

>most malware isn't obfuscated or using novel techniques, a lot of it is just spyware capabilities you clicked through to accept

How does an antivirus tell the difference between e.g. TeamViewer and a repackaged app with a RAT?

hegzploit · on March 29, 2023

here's one way I love to think about it, A RAT will go all the way to try and persist, hide from AV, load other components from some remote endpoint. It will trigger so much events that can be detected by an AV. on the other hand, TeamViewer will not try to hide what Its doing, there's also a lot more stuff at play here since this is just heuristic analysis, AVs tend to be more complex and incorporate more methods of analysis like signature-based detection and integrity checking, etc...

j-bos · on March 29, 2023

I humbly second this ask.

cancerhacker · on March 29, 2023

You can get a long way just by running /usr/bin/strings against an executable, and maybe a platform specific version of otool -L. You should have a basic idea of how your OS does linking, shared libraries, etc.

youngtaff · on March 29, 2023

Yes please do

Tried Ghidra for the first time on the weekend to look at some 8051 firmware

Got stuck with disassembly as it seems to be misinterpreting some data sections as code - can see English strings in a hex editor but Ghidra is trying to convert them to asm

HelloNurse · on March 29, 2023

You are supposed to annotate what every part of the file is and how you want to display it. It's usually easy to distinguish reasonable assembler code from nonsense instructions interspersed with undecodable islands.

Disassembling all sections just in case they contain code is a common conservative policy for disassemblers: even without malicious payload hiding tricks even definitely never executed sections could contain embedded executable code.

youngtaff · on March 31, 2023

Thanks, I'll try that approach

It's been a while since I've looked at asm in anger so it's taking me a while to get back into it (plus this is a side project ATM)