Hacker News new | past | comments | ask | show | jobs | submit login

While this doesn't so much apply to libcurl (but see below), there is a third alternative to "write everything in C" or "write everything in <some other safer language>". That is: use a safer language to generate C code.

End users, even those compiling from source, will still only need a C compiler. Only developers need to install the safer language (even Curl developers must install valgrind to run the full tests).

Where can you use generated code?

- For non-C language bindings (this could apply to the Curl project, but libcurl is a bit unusual in that it doesn't include other bindings, they are supplied by third parties).

- To describe the API and generate header files, function prototypes, and wrappers.

- To enforce type checking on API parameters (eg. all the CURL_EASY_... options could be described in the generator and then that can be turned into some kind of type checking code).

- Any other time you want a single source of truth in your codebase.

We use a generator (written in OCaml, generating mostly C) successfully in two projects: https://github.com/libguestfs/libguestfs/tree/master/generat... https://github.com/libguestfs/hivex/tree/master/generator




> generate C code.

Programmatically generating C code not without problems. How can you prove that the C you're generating is free from problems solved by the safer language? Cloudbleed came from computer generated C code: https://blog.cloudflare.com/incident-report-on-memory-leak-c....


No, it didn't.

See quote from the author of Ragel in the comments:

There is no mistake in ragel generated code. What happened was that you turned on EOF actions without appropriate testing. The original author most certainly never intended for that. He/She would have known it would require extensive testing. Legacy code needs to be tested heavily after changes. It should have been left alone.

PLEASE PLEASE PLEASE take some time to ensure the media doesn't print things like this. It's going to destroy me. You guys have most certainly benefitted from my hard work over the years. Please don't kill my reputation!


+1

And I'd like to add that what made this a catastrophic error was that different requests were served in the same address space, rather than using address space isolation as in process-per-request/fork() architectures of old. For years now many network daemon programs have been written in an event-based, single-address space style, but I have never seen the alleged process creation overhead quantified (except for maybe multi-threaded programs). Even OpenBSD's httpd disses eg. CGIs as "slowcgi" (when you'd expect the OpenBSD developers take pride in the fact that their httpd uses ASLR etc. features of the O/S rather than inventing their own ad-hoc mechanisms to defeat deterministic memory allocation in user space, and would take the opportunity to tune O/S process creation). I don't have facts to share either, I'm just puzzled that we're re-inventing O/S mechanisms in user space with performance arguments without backing this up by numbers (or are there any?).


Well, the general point still applies. The bug occurred using code that was written in a safe language and compiled to C. It's just that there are multiple ways for that to go wrong. The generator could have had a bug -- it's software, so it almost certainly does. Or, as in this case, the user didn't use it correctly. Either way, the idea that you can write code in a safe language and compile to C to eliminate the type of bugs that C allows isn't true.

Are such errors less likely? Possibly so, but they're not categorically eliminated. It becomes a risk assessment exercise rather than a simple thing that everyone should do. Note that it also opens the door to Java-style problems, where once the generator becomes ubiquitous, it becomes the most valuable target for exploit-hunting because a vulnerability in the generator gets the keys to all the houses.


You are arguing that no language X is safer than writing program manually in Y when program in X is compiled to Y. Because compiler from X to Y may have bugs.

Therefore no code written in Rust (X) executed on x86 CPU (Y) is safer than manually written x86 assemby, because Rust compiler (and LLVM) may have errors.

And well, we can actually go deeper. There is CPU frontend that is generating micro code, which may have bugs. There is also CPU backend which is executing micro code, which also may have bugs. All in all there is no hope in programming. There might be bugs everywhere so you can never be sure what your program does.


That's not what I'm saying. I'm saying "rewrite it in Rust (or whatever)" isn't some silver bullet that fixes security problems. It's always about assessing risk -- both risk of security issues as well as risk of upsetting your users, etc. Basically exactly what the article says.


> Either way, the idea that you can write code in a safe language and compile to C to eliminate the type of bugs that C allows isn't true.

Is a bit different statement than:

> I'm saying "rewrite it in Rust (or whatever)" isn't some silver bullet that fixes security problems.

The first one is wrong, the second one is true.

Using a higher level language rules out some classes of programming errors which are possible in lower level languages. The fact that compilers have bugs does little to diminish those gains.

Semantics of Haskell does not allow to express program that generates double free [0]. Perhaps one of the compilers will compile some Haskell code to binary that frees memory twice. However, this bug in compiler is far more less likely that a programmer making this mistake in C. Whats more when this bug in compiler is detected and fixed. The problem can be fixed in all affected code bases without need to change the original source code. Thus chances of bugs are lower.

Nobody really argues that Rust (or OCaml, or Haskell, or whatever) is a silver bullet, i.e. solution to all problems that will miraculously make programmers produce no bugs at all. Obviously we will have software bugs even with most restrictive languages. No amount of formal proofs will save us form misunderstanding specifications or making typos. And then again we will also have bugs in implementation of those high level abstractions.

And for the record I am really annoyed with movement to rewrite everything in Rust.

[0] Yes, you can call free through FFI with whatever arguments you like, as many times as you like. But for sake of brevity let's assume this is not how you write your everyday Haskell.


The hope is writing a formal description of required architecture functionality (formal proof) and then validating the proof. Not 100% safe against non deterministic issues or very complex but good against most others.


So no code is safe? All code before execution has to be lowered to some evil, unsafe language, most commonly the assembly language of targeted CPU.

The mystical process of "programmatically generating code" in also known as compilation. The case you are describing is a compiler bug. The compiler wasn't able to generate target code (in this case C code) with semantics and/or guarantees of the source language.


More generally, I don't understand this argument. Assuming you can trust the C compiler (big if, but at least some validated (large subset of) C compilers exist; see CompCert), I don't get why this would be worse then generating machine code in a safe language.


Generating C code that (waving hands here) generates machine code is more complex than just generating machine code.


This is simply not true. C in this case is just an intermediate representation of the source program. Going through multiple intermediate representation of the source code is fairly standard practice when compiling anything. If anything it is easier to target C than directly generate target CPU assembly, because of the high level nature of C (you finish the compilation earlier, without last couple of lowering steps).


You're forgetting the elephant in the room: undefined behaviour.

Sure, you can target one compiler and be sure you'll be generating the desired machine instructions, but it can be much more difficult to ensure that your code will produce safe machine code when compiled with all possible C compilers, and the techniques used may result in a slower end result.

If you go straight from a high-level language to a compiler IR, you have a much lower risk of having to choose between either underspecifying your invariants or overspecifying them at the expense of performance.

TL;DR: C wasn't designed as a compiler IR and that complicates things.


I agree with that. It is tempting to top it by saying C wasn't designed to do anything well and that has complicated things over the last 45 years. On the other hand it's not like there has been a traditional wealth of wonderful ready-made IRs with cross-architecture backends for your high level language to chose from either, so I'm still not convinced that compiling to C is harder than to do compile to machine code yourself, especially in the common case where you don't have to get the last ounce of possible performance out.


Well, we can agree to disagree about this, but in my experience third party tools (like helpful debugging symbols) in particular suffer when there are extra intermediate languages. Extra metadata needs to be passed through more layers of abstraction.

And as a human I have had the same issues acting as a meat-implemented debugger. I had to drill through more layers to figure out why low level things happened.


Of course metadata is lost if not encoded anywhere on the way. The argument was made regarding code generation being more complex when code is saved on the intermediate level.


> if not encoded anywhere

You're understating the problem a bit. There's no standard way to mark up C code as mapping back to the original source code so that metadata (source lines, memory mapping back to data structures) can be passed on to the compiled binaries. If the original language generated DWARF-encoded objects, then debuggers would just work, etc.


Compiling X to C and then C to assembly is not more complex that compiling X straight to assembly. In your orignal comment you wrote that the complexity of such setup is bigger, to which I responded: no, not really.

Yes, C was not designed to be intermediate compilation step and this yields losses of some information (e.g. debugging metadata, but also some semantics of source language may get lost). I never argued with that. I never said that this is a perfect setup that doesn't introduce any new problems. I just said that compiling to C is very close to what is actually happening inside the compiler targeting assembly from higher level language.


You just have a narrower scope of what counts as complexity. Mine includes things that complicate humans and debuggers understanding and analyzing the ultimate bytecode.

The techniques and difficulty in implementing the compiler itself are related but not really the same subject.


No we cannot prove that. However it is still better than the "write it in C" option because once you fix a bug in the generator, it's fixed in all current and future generated code. In other words, we no longer make the same mistakes over and over again.


> How can you prove that the C you're generating is free from problems solved by the safer language?

By formal verification. There are ways to do so and several verified compilers already exist.


FFTW[0] is also written like that (generator written in OCaml emitting C).

[0]http://www.fftw.org/


> generate C code.

how is that different from just writing it in another language? End users who need to compile will be able to regardless of the generated C code, but the end users who need to do a _little_ modification will be given ugly generated C code! Seems stictly worse to me...


In the libguestfs generator (first link above) the generated C code is required to be completely readable. It must look like it was written by hand (albeit by a programmer who is impossibly consistent and perfect). So reading the generated C code is fine. Modifying the generated code is of course not fine except for tiny test hacks, but we also include in the generated code comments reflecting where in the generator the code comes from.


I've created a number of code generators in my projects. Invariably, developers say exactly what you just wrote, "how do I modify the generated code"?

The answer is not to modify the generated code. Modify the input to the code generator to make changes.

Even when I output a warning to this effect, that all modifications to target code will get overwritten, not to check the target code into version control, the source code is already checked into version control - invariably developers modify the target code right under the comment that says not to, then they check it into version control. They then wonder why there are bugs, and their modified target code no longer works after the target code gets regenerated after the next build.


It's almost as though you can't solve the problem of programmers making errors by having a different set of programmers whom you tell to not make errors.


The issue had nothing to do with programmers.

The impetus wasn't that programmers make errors but to solve the problem of repeatability. Many instances of issues can be solved once. There is no need to recreate the solution a number of times if it is already solved.

A code generator allows one to focus on the actual meta-problem, which is often smaller and easier to solve.


The difference is that you don't need compiler for this language. There are many hardware platforms that only come with C compiler.


For something like curl, where the library is as popular as the command line tool, preserving the C ABI compatibility is probably the strongest reason.


Rust could expose a C ABI while keeping safe internals. The interface itself would be unsafe of course. There are a few things that rust doesn't handle natively (like varargs functions IIRC) but other than that you could probably write a Rurl that would be completely backward compatible with Curl.


To be clear, Rust itself does not have varargs, but can handle them with the C ABI.


Well, we can call into vararg functions, but not define them.

Since vararg functions have the same ABI as the function with only one of the vararg one idea I've always had is to write a macro that lets you write a one-arg function and have it desugar via asm hax.


Yes...the post above me was talking about Ocaml. Similar arguments for not redoing curl in Go.


You do still need a compiler for the language. It's just that the target language is C, instead of assembly.


I wonder how cloudflare feels about that? Ragel




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: