Hacker News new | past | comments | ask | show | jobs | submit login
Snowman: native code to C/C++ decompiler (derevenets.com)
250 points by ingve on July 25, 2017 | hide | past | favorite | 48 comments



radare2 [1] project is also working on a decompiler, which uses ESIL [2] intermediate language as a source and lifts it to the RadecoIL, whish is then simplified and transformed to C. The missing parts now are mostly Memory SSA, C AST generation (partially done) and Type Inference. The decompiler itself written in Rust and uses the radare2 as a source of ESIL and other metainformation. Using the ESIL as a source will allow to implement the support for a different architectures, not only the common ones. Currently we're running RSoC - Radare Summer of Code [3], and hope that our 2 students will make the significant progress on both Rune (Symbolic Execution on top of ESIL) and Radeco projects. And we are always happy to welcome a new potential contributors to all underlying projects, including radare2 itself. If you want to help us - please join #radare IRC channel or #radare Telegram channel [4]. The sources of Radeco are located at https://github.com/radare/radeco-lib

[1] http://rada.re

[2] https://radare.gitbooks.io/radare2book/content/disassembling...

[3] http://radare.today/posts/RSOC-2017/

[4] https://telegram.me/joinchat/ACR-FkEK2owJSzMUYjt_NQ


I'll preface this by saying that I love radare2. It's my goto tool when I don't need to share work with IDA/Binja users and don't need to decompile something.

The radeco project is a train wreck. The current state of radeco-lib (unless it's been remediated in the last month) is disappointing and the only reason it compiles is because the last SoC student appears to have commented out the bindings that radeco is meant to use to get radeco-lib to do anything. I actually spent an evening attempting to undo that absurd series of commits but after getting a lot of the commented out back in place, not being a Rust programmer, hit roadblocks I did not understand regarding types and traits.

Unsolicited advice incoming. Please keep a close eye on your RSoC students this year. Their goals to achieve anything which they can present do not necessarily grok with the ongoing health of your project. I'd also love it if you would drop Rust and work with a more accessible language, at least while you work toward an initial version which spits out something resembling C code. Ultimately it's your project so do whatever you want but IMHO making everyone understand an inherently complex project in a language which is not straightforward is not the best option. Or at least add some documentation and make your lib and program build together...


Thanks for the great work. radare2 and binwalk[1] helped me debug quite a few things at work lately.

[1] https://github.com/devttys0/binwalk


I'm glad to see a new decompiler, but it looks like it isn't an Optimizing decompiler like Hex Rays yet.

I tested the IDA plugin and it happily gave me very long lines like this:

    esp74 = reinterpret_cast<void*>(reinterpret_cast<int32_t>(__zero_stack_offset()) - 0x104 - 4 - 4 - 4 - 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 + 4 - 4 - 4 + 4 - 4 - 4 + 4 - 4 - 4 + 4 - 4 - 4 + 4 + 20);
Edit:

Furthermore there seem to be some correctness issues, or at least misleading output.

If a string is modified at runtime (for example for obfuscation purposes) then passed as an argument, Snowman will show the original string directly like foo("incorrect", 23), instead of just using an opaque variable like foo(some_var, 23)


Looks like it needs a constant folding pass, yep.


I saw that too with the standalone binary --- it seems to be a failure to understand local variables on the stack.


Another open source decompiler is fcd:

https://zneak.github.io/fcd/

I quite like the authors' blog about the development of the decompiler, as it gives a lot of insight into how it works and what academic literature it draws on.

You can also find a video of a talk the author (Felix Cloutier) gave at the Security Open Source workshop:

https://www.youtube.com/watch?v=h1NP-DV4GVQ


Sorry for the maybe silly question, but why does one need a decompiler? Isn't it easier to look an disassembly from tools like objdump? The example from the Hello World decompilation does not look significantly more readable to me than a disassembly (with some basic knowledge of assembler).


A good decompiler can have a massive impact on the readability of the code. For example, here's a study where the authors found that their decompiler allowed students without reverse engineering expertise to approach the performance of RE experts on some tasks.

https://net.cs.uni-bonn.de/fileadmin/ag/martini/Staff/yakdan...

Sadly, DREAM++ has never been released open source :(


If there's a binary of it, you could use it to decompile itself. A kind of reverse bootstrapping.


Sadly there is no binary, only papers. Now, if someone could come up with a technique for automatically creating source code from a PDF description... :D


Easier to turn back into editable code, change it, recompile it, port to another processor without having to implement a binary translator, just a few ideas.


It's easier to read pseudo-C than assembly, especially on an architecture you're not terribly familiar with. I can fire up an ARM decompiler (Hex-Rays) and be able to do some basic RE work in minutes on an architecture I've never touched assembly for. It's also just plain faster to read, though the nice graph views for assembly do help you understand how loops and conditionals are really structured, it's still not as good as C.


Coincidentally a colleague of mine tried it yesterday and ended up with a better decompilation than hopper. Hopper wouldn't catch a loop and would display it weirdly while Snowman just worked. I've been wondering if binary ninja would have gotten good results, but there is no demo for 64-bit binaries.

Unfortunately for me I'm stuck with Hopper as Snowman is windows only.


What do you use a decompiler for? Just for fun, or is it part of your work?


For fun :) there is this challenge here that is ending today: https://github.com/kudelskisecurity/cryptochallenge17/blob/m...


Has anyone tried training a RNN on high level language <-> assembly?

Would be cool if it could even guess variable names from patterns it has seen before, like x,y,z for vector structs.


I dunno... If you took a few random textbook physics problems, and replaced all of the nouns and units with arbitrary consistent strings, do you think that it would be possible to tell an electromagnetic problem apart from a plumbing problem? What if the subject is an electric pump?


It seems like we're at the point where something might be able to be done, but having the set up be

assembly -> RNN -> C

Would be asking for a bit much IMO. If you develop something like snowman that outputs many possible equivalent C ASTs and train something that tries to choose the one that was the original code.


IMHO the "holy grail" of decompilation is to decompile the compiler, compile the decompiled decompiler, and get back a functioning decompiler that can also decompile itself ad infinitum. After several iterations, it may reach a fixed-point... this is essentially the exact opposite of what's customarily done with compilers: compile the compiler with itself, and repeat with the self-compiled compiler until a fixed-point is reached.

Thus, I naturally tried this one on itself, but that didn't work so well --- it spent several minutes analysing, then crashed.

Then I picked something slightly easier, upon which it did manage to decompile successfully, but the output is... not exactly what I expected. Copious void pointers of various levels of indirection (plenty of "three-star-programmer" code...) and reinterpret_cast sprinkled everywhere --- I have the original code and it was written in C, so it amusingly enough decided to automatically convert it to C++, along with the inability to recognise accesses to local variables leading to long sequences of -4-4-4-4+4-4-4+..., mean that for me it's not really all that better than reading the Asm directly.

The latter test was with a binary compiled with a very old compiler, so I suspect something with the newest optimising compilers will produce even more confounding output.

That said, it's great to see plenty of decompilers being written and released publicly; I remember around 2 decades ago when any mention of decompilation would be met with disdain and chants of "that's impossible!" Hex-Rays and IDA may have spurred a lot of this development; but speaking from experience, cracking groups have always written their own private decompiler-ish tools, mainly featuring dataflow analysis.


And it has been integrated into the awesome x64dbg for quite a while. :)


There's also a (simple) plugin for radare2!

https://github.com/radare/radare2-extras/tree/master/r2snowm...


Can be installed using internal package manager of radare2:

r2pm -i r2snow


Huh, that looks a lot like OllyDbg. Do you know how it compares to Olly/IDA/Binja?


Newer versions of OllyDbg never caught on and a stable working x64 version has never hit. The main community for Olly sort of died with XP 32-bit unfortunately. x64dbg has largely replaced it for 64-bit code and many of its common features and plugins for 32-bit code on 64-bit OSes too. Unlike Olly, x64dbg is also open source, which has lent to some interesting features, though it's still largely the work of Mr.Exodia as far as I know.

Of course, debuggers like x64dbg fit a niche for when you need to do a lot of dynamic work, like dumping data or unpacking an executable, they're not static analysis tools like IDA or Binja, which you'll probably want if you're analyzing algorithms and trying to understand the inner workings of something.

EDIT: I have a [dead] reply saying everyone moved to Immunity that I'd like to respond to. There's some truth to that in some communities, but the catch there is that Immunity never got good x64 support either. To my knowledge, it's basically just an Olly fork with some additional scripting features and stuff, doesn't solve the 64-bit problem or the 32-bit on 64-bit OSes problem.


I guess it wouldn't even exist if OllyDbg x64 would be a (non-alpha) thing. x64dbg provides a number of plugins in order to fill missing features to IDA/Olly: https://github.com/x64dbg/x64dbg/wiki/Plugins


Looks like a very interesting project, but maybe there's some misunderstandings about the license; from the readme:

> x64dbg is licensed under GPLv3, which means you can freely distribute and/or modify the source of x64dbg, as long as you share your changes with us.

Should probably read: "... as long as you make genuine offer of providing the source code and changes to those you distribute your version of x64dbg to."

In practice it of course makes sense to upstream changes, but there's nothing in the gpl about that.


This is in fact on purpose. Basically I stated my intent of using GPL.


That's fine, and it is of course how many projects use the GPL in most cases in practice -- but as it reads in the readme, it sounds like the GPL doesn't [allow] someone to fork the project, port it to say, OS X, or arm - and sell the changed fork to a to a customer without giving the changes back upstream. The porter would have to offer sources to the customer, and the customer would be free to upstream the sources - but from the GPL, there's no legal compulsion to do so.

Anyway, I guess I would have reworded it somewhat, to make it more obvious that the source is under GPL, but that the project welcomes and encourages upstreaming changes. This opposed to the code being under a modified GPL.


Ahh, I see. That looks very interesting, thanks, though Windows-only.


Could you be more specific about which features are missing?


I used x64dbg recently, because I had to debug Win64 code and both Ollydbg and IDA Free don't support it.

I like the x64dbg UI a lot more, I loved the persistent breakpoints and it was just very easy to pick up and work with. I found one shortcoming: it doesn't have an inline memory viewer - but it can export a memory section to a file.


Generally I use the memory map as a detached window over the CPU, that way you can check out the memory in one of the dump tabs...


I would love to see support for this on gobolt.org. It would be really fun to see an optimizer's output expressed as C. For example, you could easily see the results of strength reduction operations, where something like "x / 2" is compiled into "x >> 1".


https://derevenets.com/examples.html

So, a decompiler is cool and all, but...a five-line "Hello World" program turned into a 144-line decompiled program. Is that an accomplishment? I'm pretty sure the "reconstructed" C from that is longer than the assembly.

EDIT: Just to confirm, this is what I got when I put the Hello World code into "hello.c" and ran GCC against it:

  gcc -O2 -S -c hello.c
hello.s:

          .file   "hello.c"
          .section        .rodata.str1.1,"aMS",@progbits,1
  .LC0:
          .string "Hello, World!"
          .text
          .p2align 4,,15
  .globl main
          .type   main, @function
  main:
  .LFB11:
          .cfi_startproc
          subq    $8, %rsp
          .cfi_def_cfa_offset 16
          movl    $.LC0, %edi
          call    puts
          xorl    %eax, %eax
          addq    $8, %rsp
          .cfi_def_cfa_offset 8
          ret
          .cfi_endproc
  .LFE11:
          .size   main, .-main
          .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-18)"
          .section        .note.GNU-stack,"",@progbits


You haven't linked the program in your assembly example. All the extra code you see there is a result of the libc startup code. Decompilers work starting from then entry point (not your program's main), which is why there's so much extra code. If you look at just the code starting from main, you get something much simpler:

    int64_t puts = 0x4003e6;

    void func_4003e0(int64_t rdi) {
        goto puts;
    }

    int64_t main() {
        func_4003e0("Hello, World!");
        return 0;
    }


Alright, that makes much more sense. Thanks!


> So, a decompiler is cool and all, but...a five-line "Hello World" program turned into a 144-line decompiled program. Is that an accomplishment? I'm pretty sure the "reconstructed" C from that is longer than the assembly.

Yeah, as you've discovered, there's a lot of "magic" going on behind the scenes to establish the standard C runtime environment that executes your main() function.

Here's a great writeup that illustrates the, uh, extensive amount of work that happens after spawning a new process but before main() is called:

http://dbp-consulting.com/tutorials/debugging/linuxProgramSt...


For those who curious and who is in Berlin:

There is going to be an event on this Thursday (July 27, 2017) where the author of this tool will be talking about decompilation.

https://www.meetup.com/LLVM-Social-Berlin/events/241197713/


Do you know if this talk will be recorded? I would love to watch but it is a bit far from NYC :)


Yes, we are going to record it. But the publishing is up to the speaker. I will post the link here if it happens.


From the examples:

int64_t puts = 0x4003e6;

void func_4003e0(int64_t rdi) { goto puts; }

What is this? Is there some compiler that will actually accept this use of goto? Is it just a convention meant for human consumption to translate jump instructions with no translated target? Is it a bug in the decompiler?


Strange. Could it be related to labels as values? https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html


Could you compile a Go program and then decompile to C? I can see actual uses for that like porting to old OSes.


Decompiled programs are full of approximations and guesses, and generally can't be recompiled without manual tweaking of the source code.

For example, "mov eax, 0x0040137C"

Is is this an address or a numeric constant?

Should it be translated to:

1) var = 4199292;

2) var = &SomeGlobalVariable;

If you're lucky you might get a non-ambiguous answer from the relocation table embedded in your executable.

So you're probably better off writing a new backend for the Go compiler (a backend generating machine code, or C code).

The resulting program will have, by far, a lot less dangerous approximations.


I can see the benefits, but I doubt it would be that simple. The Go code you decompiled would depend on a Go runtime. That specific Go runtime would then have dependencies on OS libraries. So for example, when you open a file in Go, I'd imagine that this functionality is built on top of the file handling functionality of Windows/OSX/Linux. You could work around these dependencies, but it's probably less hassle to port the Go runtime to the new OS.


Doesn't the Go runtime get linked in and would just get decompiled? It would generate a huge blob of C but the idea is just to port.


Not everything that the program needs to run is included in the binary.

Let's use a more concrete example. Let's say we write a Go program that copies a file. On Windows, this might use an API call like CopyFile:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

If you decompile the compiled Go program into C, it'd still have the references to API calls like this. These APIs would have to be implemented on the new OS for the decompiled program to work without modification.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: