While this doesn't so much apply to libcurl (but see below), there is a third al...

KuiN · on March 27, 2017

> generate C code.

Programmatically generating C code not without problems. How can you prove that the C you're generating is free from problems solved by the safer language? Cloudbleed came from computer generated C code: https://blog.cloudflare.com/incident-report-on-memory-leak-c....

patrec · on March 27, 2017

No, it didn't.

See quote from the author of Ragel in the comments:

There is no mistake in ragel generated code. What happened was that you turned on EOF actions without appropriate testing. The original author most certainly never intended for that. He/She would have known it would require extensive testing. Legacy code needs to be tested heavily after changes. It should have been left alone.

PLEASE PLEASE PLEASE take some time to ensure the media doesn't print things like this. It's going to destroy me. You guys have most certainly benefitted from my hard work over the years. Please don't kill my reputation!

tannhaeuser · on March 27, 2017

+1

And I'd like to add that what made this a catastrophic error was that different requests were served in the same address space, rather than using address space isolation as in process-per-request/fork() architectures of old. For years now many network daemon programs have been written in an event-based, single-address space style, but I have never seen the alleged process creation overhead quantified (except for maybe multi-threaded programs). Even OpenBSD's httpd disses eg. CGIs as "slowcgi" (when you'd expect the OpenBSD developers take pride in the fact that their httpd uses ASLR etc. features of the O/S rather than inventing their own ad-hoc mechanisms to defeat deterministic memory allocation in user space, and would take the opportunity to tune O/S process creation). I don't have facts to share either, I'm just puzzled that we're re-inventing O/S mechanisms in user space with performance arguments without backing this up by numbers (or are there any?).

deong · on March 27, 2017

Well, the general point still applies. The bug occurred using code that was written in a safe language and compiled to C. It's just that there are multiple ways for that to go wrong. The generator could have had a bug -- it's software, so it almost certainly does. Or, as in this case, the user didn't use it correctly. Either way, the idea that you can write code in a safe language and compile to C to eliminate the type of bugs that C allows isn't true.

Are such errors less likely? Possibly so, but they're not categorically eliminated. It becomes a risk assessment exercise rather than a simple thing that everyone should do. Note that it also opens the door to Java-style problems, where once the generator becomes ubiquitous, it becomes the most valuable target for exploit-hunting because a vulnerability in the generator gets the keys to all the houses.

mbel · on March 27, 2017

You are arguing that no language X is safer than writing program manually in Y when program in X is compiled to Y. Because compiler from X to Y may have bugs.

Therefore no code written in Rust (X) executed on x86 CPU (Y) is safer than manually written x86 assemby, because Rust compiler (and LLVM) may have errors.

And well, we can actually go deeper. There is CPU frontend that is generating micro code, which may have bugs. There is also CPU backend which is executing micro code, which also may have bugs. All in all there is no hope in programming. There might be bugs everywhere so you can never be sure what your program does.

deong · on March 27, 2017

That's not what I'm saying. I'm saying "rewrite it in Rust (or whatever)" isn't some silver bullet that fixes security problems. It's always about assessing risk -- both risk of security issues as well as risk of upsetting your users, etc. Basically exactly what the article says.

mbel · on March 27, 2017

> Either way, the idea that you can write code in a safe language and compile to C to eliminate the type of bugs that C allows isn't true.

Is a bit different statement than:

> I'm saying "rewrite it in Rust (or whatever)" isn't some silver bullet that fixes security problems.

The first one is wrong, the second one is true.

Using a higher level language rules out some classes of programming errors which are possible in lower level languages. The fact that compilers have bugs does little to diminish those gains.

Semantics of Haskell does not allow to express program that generates double free [0]. Perhaps one of the compilers will compile some Haskell code to binary that frees memory twice. However, this bug in compiler is far more less likely that a programmer making this mistake in C. Whats more when this bug in compiler is detected and fixed. The problem can be fixed in all affected code bases without need to change the original source code. Thus chances of bugs are lower.

Nobody really argues that Rust (or OCaml, or Haskell, or whatever) is a silver bullet, i.e. solution to all problems that will miraculously make programmers produce no bugs at all. Obviously we will have software bugs even with most restrictive languages. No amount of formal proofs will save us form misunderstanding specifications or making typos. And then again we will also have bugs in implementation of those high level abstractions.

And for the record I am really annoyed with movement to rewrite everything in Rust.

[0] Yes, you can call free through FFI with whatever arguments you like, as many times as you like. But for sake of brevity let's assume this is not how you write your everyday Haskell.

AstralStorm · on March 28, 2017

The hope is writing a formal description of required architecture functionality (formal proof) and then validating the proof. Not 100% safe against non deterministic issues or very complex but good against most others.

mbel · on March 27, 2017

So no code is safe? All code before execution has to be lowered to some evil, unsafe language, most commonly the assembly language of targeted CPU.

The mystical process of "programmatically generating code" in also known as compilation. The case you are describing is a compiler bug. The compiler wasn't able to generate target code (in this case C code) with semantics and/or guarantees of the source language.

patrec · on March 27, 2017

More generally, I don't understand this argument. Assuming you can trust the C compiler (big if, but at least some validated (large subset of) C compilers exist; see CompCert), I don't get why this would be worse then generating machine code in a safe language.

humanrebar · on March 27, 2017

Generating C code that (waving hands here) generates machine code is more complex than just generating machine code.

mbel · on March 27, 2017

This is simply not true. C in this case is just an intermediate representation of the source program. Going through multiple intermediate representation of the source code is fairly standard practice when compiling anything. If anything it is easier to target C than directly generate target CPU assembly, because of the high level nature of C (you finish the compilation earlier, without last couple of lowering steps).

ssokolow · on March 28, 2017

You're forgetting the elephant in the room: undefined behaviour.

Sure, you can target one compiler and be sure you'll be generating the desired machine instructions, but it can be much more difficult to ensure that your code will produce safe machine code when compiled with all possible C compilers, and the techniques used may result in a slower end result.

If you go straight from a high-level language to a compiler IR, you have a much lower risk of having to choose between either underspecifying your invariants or overspecifying them at the expense of performance.

TL;DR: C wasn't designed as a compiler IR and that complicates things.

patrec · on March 28, 2017

I agree with that. It is tempting to top it by saying C wasn't designed to do anything well and that has complicated things over the last 45 years. On the other hand it's not like there has been a traditional wealth of wonderful ready-made IRs with cross-architecture backends for your high level language to chose from either, so I'm still not convinced that compiling to C is harder than to do compile to machine code yourself, especially in the common case where you don't have to get the last ounce of possible performance out.

humanrebar · on March 27, 2017

Well, we can agree to disagree about this, but in my experience third party tools (like helpful debugging symbols) in particular suffer when there are extra intermediate languages. Extra metadata needs to be passed through more layers of abstraction.

And as a human I have had the same issues acting as a meat-implemented debugger. I had to drill through more layers to figure out why low level things happened.

mbel · on March 27, 2017

Of course metadata is lost if not encoded anywhere on the way. The argument was made regarding code generation being more complex when code is saved on the intermediate level.

humanrebar · on March 27, 2017

> if not encoded anywhere

You're understating the problem a bit. There's no standard way to mark up C code as mapping back to the original source code so that metadata (source lines, memory mapping back to data structures) can be passed on to the compiled binaries. If the original language generated DWARF-encoded objects, then debuggers would just work, etc.

mbel · on March 27, 2017

Compiling X to C and then C to assembly is not more complex that compiling X straight to assembly. In your orignal comment you wrote that the complexity of such setup is bigger, to which I responded: no, not really.

Yes, C was not designed to be intermediate compilation step and this yields losses of some information (e.g. debugging metadata, but also some semantics of source language may get lost). I never argued with that. I never said that this is a perfect setup that doesn't introduce any new problems. I just said that compiling to C is very close to what is actually happening inside the compiler targeting assembly from higher level language.

humanrebar · on March 27, 2017

You just have a narrower scope of what counts as complexity. Mine includes things that complicate humans and debuggers understanding and analyzing the ultimate bytecode.

The techniques and difficulty in implementing the compiler itself are related but not really the same subject.

rwmj · on March 27, 2017

No we cannot prove that. However it is still better than the "write it in C" option because once you fix a bug in the generator, it's fixed in all current and future generated code. In other words, we no longer make the same mistakes over and over again.

fiedzia · on March 27, 2017

> How can you prove that the C you're generating is free from problems solved by the safer language?

By formal verification. There are ways to do so and several verified compilers already exist.

mushiake · on March 27, 2017

FFTW[0] is also written like that (generator written in OCaml emitting C).

[0]http://www.fftw.org/

chii · on March 27, 2017

> generate C code.

how is that different from just writing it in another language? End users who need to compile will be able to regardless of the generated C code, but the end users who need to do a _little_ modification will be given ugly generated C code! Seems stictly worse to me...

rwmj · on March 27, 2017

In the libguestfs generator (first link above) the generated C code is required to be completely readable. It must look like it was written by hand (albeit by a programmer who is impossibly consistent and perfect). So reading the generated C code is fine. Modifying the generated code is of course not fine except for tiny test hacks, but we also include in the generated code comments reflecting where in the generator the code comes from.

IncRnd · on March 27, 2017

I've created a number of code generators in my projects. Invariably, developers say exactly what you just wrote, "how do I modify the generated code"?

The answer is not to modify the generated code. Modify the input to the code generator to make changes.

Even when I output a warning to this effect, that all modifications to target code will get overwritten, not to check the target code into version control, the source code is already checked into version control - invariably developers modify the target code right under the comment that says not to, then they check it into version control. They then wonder why there are bugs, and their modified target code no longer works after the target code gets regenerated after the next build.

deong · on March 27, 2017

It's almost as though you can't solve the problem of programmers making errors by having a different set of programmers whom you tell to not make errors.

IncRnd · on March 27, 2017

The issue had nothing to do with programmers.

The impetus wasn't that programmers make errors but to solve the problem of repeatability. Many instances of issues can be solved once. There is no need to recreate the solution a number of times if it is already solved.

A code generator allows one to focus on the actual meta-problem, which is often smaller and easier to solve.

fiedzia · on March 27, 2017

The difference is that you don't need compiler for this language. There are many hardware platforms that only come with C compiler.

tyingq · on March 27, 2017

For something like curl, where the library is as popular as the command line tool, preserving the C ABI compatibility is probably the strongest reason.

simias · on March 27, 2017

Rust could expose a C ABI while keeping safe internals. The interface itself would be unsafe of course. There are a few things that rust doesn't handle natively (like varargs functions IIRC) but other than that you could probably write a Rurl that would be completely backward compatible with Curl.

steveklabnik · on March 27, 2017

To be clear, Rust itself does not have varargs, but can handle them with the C ABI.

Manishearth · on March 27, 2017

Well, we can call into vararg functions, but not define them.

Since vararg functions have the same ABI as the function with only one of the vararg one idea I've always had is to write a macro that lets you write a one-arg function and have it desugar via asm hax.

tyingq · on March 27, 2017

Yes...the post above me was talking about Ocaml. Similar arguments for not redoing curl in Go.

wtetzner · on March 29, 2017

You do still need a compiler for the language. It's just that the target language is C, instead of assembly.

ndesaulniers · on March 27, 2017

I wonder how cloudflare feels about that? Ragel