Some Were Meant for C: The Endurance of an Unmanageable Language [pdf]

jasode · on Sept 5, 2017

This is a long paper and the author has 2 main claims:

1) C Language popularity is more to do with cognitive ease of memory addresses as a conceptual model for inspection and change. Author claims memory address mental model overshadows runtime performance.

2) switching to "safe" languages like Java/C#/Rust is not necessary. With no changes/violations to existing C Language specification, a new/different implementation (compiler) can add more runtime safety checks similar to managed languages. An example from the paper:

>Consider unchecked array accesses. Nowhere does C define that array accesses are unchecked. It just happens that implementations don’t check them. This is an implementation norm, not a fact of the language.

Those 2 ideas look orthogonal but he ties them together at the end.

I'll take some poetic license (e.g. a little exaggeration) to reword the author's idea to help spur discussion...

Consider the idea of the Sufficiently Smart Compiler[1] that claims that a "slow" and "high-level" language like Python/Ruby could be theoretically analyzed and compiled to be as fast as C or handcrafted assembly.

In a way, the author is coming from the opposite direction. If you had a "Sufficiently Smart Runtime" for a new C Language compiler implementation, it could (theoretically) do all sorts of extra checks and bookkeeping that wouldn't require any changes to C source code and wouldn't violate the existing C Language standard. (E.g. Imagine a new C runtime that did many checks similar to Valgrind + UBSAN + ASAN + debugger memory fences, etc.)

Would the program execution be slower? Well yes, but that's not really an issue because according to author's claim #1, what programmers really like about C is the mental ease of accessing memory addresses. The performance is important, but it's a secondary benefit -- according to the author.

[1] http://wiki.c2.com/?SufficientlySmartCompiler

nickpsecurity · on Sept 6, 2017

The problem is that it isn't a new idea. People keep trying it as shown below. Unfortunately, C wasn't designed so much as a modified version of something (BCPL) that was the only collection of features Richards could get to compile on his crappy hardware. It's not designed for easy analysis or safety. So, all the attempts are going to hit problems in what legacy code they can support, their performance, or even effectiveness if about reliability/security in a pointer-heavy language. Compare that to Ada, Wirth's stuff, or Modula-3 to find they don't have that problem or have much less of it because they were carefully designed balancing the various tradeoffs. Ada even meets author's criteria for safe language with explicit memory representation despite him saying safe languages don't have that.

To back that up with references, first is a bunch of attempts at safer C's or C-like languages with performance issues. The next two are among most recent and practical at memory safety for C apps far as CompSci goes. The last one is an Ada book that lists by chapter each technique its designer used to systematically mitigate bugs or vulnerabilities in systems code.

https://pdfs.semanticscholar.org/a890/a850dc78e65e26f8f4def4...

https://llvm.org/pubs/2006-06-12-PLDI-SAFECode.html

https://www.cs.rutgers.edu/~santosh.nagarakatte/softbound/

http://www.adacore.com/uploads/technical-papers/SafeSecureAd...

jstewartmobile · on Sept 6, 2017

1% inspiration, 99% perspiration. Needs more sweat.

ChuckMcM · on Sept 6, 2017

Excellent, I think the author would do well to re-frame the question as you have. If nothing else to put it more clearly into the space of provable compilation.

When I transferred into the "Oak" group that later became the "Java" organization, the team I was on was looking at whether or not you could write an OS in Java sort of in spite of its safety rules. This sort of concept has been revisited by Rust with its safe/unsafe modal operation.

What both of those efforts have in common is that determining safety may be impossible at the construct level but provable if you were to exhaustively search all possible outcomes.

What the paper and your comment add to the discussion is the intriguing idea that you could create a 'safe' backend (say the equivalent of the JVM) as a target for a C compiler. And code that could not be compiled would be flagged for later analysis. Much like VHDL can express hardware that cannot by synthesized you might end up with a C compiler that could compile code that could not be executed. I could be fun to spend a bit of time poking around that rabbit hole.

mike_hearn · on Sept 6, 2017

It's a very intriguing idea and you may be pleased to know that such research lines are still being explored:

http://ssw.jku.at/General/Staff/ManuelRigger/ManLang17.pdf

marktangotango · on Sept 6, 2017

The idea you can have a safe c compiler or runtime seems totally absurd to me. Why is any of this even considered seriously?

Koshkin · on Sept 6, 2017

Indeed. Even the "obvious" example of the compiler inserting bound checks in the generated code does not work with the well-known method of marking the beginning of a variable-length memory block at the end of a struct using an array of some fixed size, say, 1 (or even 0).

ktRolster · on Sept 6, 2017

For that situation, you would need to mark the end of the variable-length memory block at run-time.

WalterBright · on Sept 6, 2017

> mental ease

I'm a long time C programmer, and I was struck by how clumsy and error-prone any manipulation of C strings turns into. It's really hard to look at a mass of strlen/strcpy/memcpy/etc. and see just what is happening. Contrast that with, say, BASIC or Javascript, where string manipulation is easy, natural, and bug-free.

I'm going to disagree about the mental ease of programming in C, and a large part of that is difficulty in building useful abstractions around the pointer model.

ktRolster · on Sept 6, 2017

That particular problem (strlen/strcpy/memcpy) comes from the problems of the standard library string functions. It can be solved by creating your own string library. Then string manipulation is easy.

rurban · on Sept 6, 2017

This problem was actually solved, but almost nobody uses it. Safe variants of most of those string, memory, io, wchar, stdlib and misc functions are defined in the C11 standard Annex K (finally after 9 years), but nobody is using it, and rather propose to keep using known unsafe variants like the truncating versions with an n. Like snprintf and not the safe variant sprintf_s. glibc, bsd, darwin, musl, newlib: nobody cares to implement the safe bounds checking variants. They solely rely on the compile time size checks, which fail to check any dynamic boundaries. Only Microsoft, Android, Cisco and Embarcadero implement the safe libc functions. I recently took over Cisco's safelibc (MIT licensed) and extended it to more platforms, all C11 api's, and an improved testsuite. And boy was I surprised to find so many missing API's, upstream libc bugs and wrong API's everywhere. Flawless were only musl and the BSD's. But musl is lacking with it's errno and of course zero C11. Only ReactOS has a proper testsuite for their libc. Glibc is somewhat ok, but I still find crashes daily.

https://github.com/rurban/safeclib

So why is nobody else implementing C11? I'll write a blog post when I finished my C11 efforts. Maybe at least FreeBSD will take it then.

pjmlp · on Sept 8, 2017

Annex K is not safe, just pretends to be.

By tracking the pointer and sizes as separate function arguments, the possibility of mixing parameters, leading to memory corruption is still there.

This is the major motivation why almost nobody uses it and it was made into an optional annex.

rurban · on Sept 10, 2017

No. The major motivation not to use it was _FORTIFY_SOURCE with it's compile checks for compile-time known buffer sizes and it's accompanying _chk functions. This leaves out all dynamic buffers.

You cannot mix PTR + LONG args without serious compile-time errors

pjmlp · on Sept 10, 2017

I don't have any idea how _FORTIFY_SOURCE works, other than it is GCC specific and as such no place in ANSI C.

What I know is that having something like strcpy_s() does not provide any actual safety, because with the prototype "strcpy_s(char * restrict s1, rsize_t s1max, const char * restrict s2)" there is no guarantee that s1max is a valid size for s1.

rurban · on Sept 12, 2017

This is what the _chk functions do. In most cases it know the compile-time size of s1. But in dynamic cases the _s functions are far better than the truncating 'n' versions. Read the rationale.

rurban · on Sept 7, 2017

Done here: https://rurban.github.io/safeclib/doc/safec-3.0/d1/dae/md_do...

WalterBright · on Sept 6, 2017

That falls over as soon as you integrate with anybody else's C code, including the operating system APIs, and with C string literals :-(

If it was as easy as you say, it would have happened.

And heaven knows I wrote my own string packages, one after the other, and so did everyone else. I eventually abandoned all of them. C's abstraction abilities are simply not good enough to do a decent string encapsulation.

wahern · on Sept 6, 2017

No other language solves this perfectly either, certainly not in a way that interoperates _across_ languages and environments.[1] Which is pretty much the whole point of the article. But what C excels at is the ability to write code which can examine and work with the representation of most string-like objects exported from any environment. The difficulty of doing so is a function of how opaque and complex the alien implementation.

I gave up on trying to solve strings in C applications a long time ago, too, much as you have. I did so not because I found C too inexpressive, but because I realized that I was trying to shoe-horn too many concepts into a "string". A string is almost by definition the wrong data structure--either too abstract or not abstract enough--for almost everything. Not coincidentally, that was about the same time I stopped abusing regular expressions for parsing data.

[1] Even C++ didn't solve this. We're still in the midst of a std::string ABI compatibility break in the C++ ecosystem. Granted, it's been about 12 years since the last one, but these last fairly long because systems software (i.e. infrastructure software) has a really long tail.

cozzyd · on Sept 6, 2017

Not to mention that in C++ there are plenty of string implementations predating std::string (e.g QT's QString, ROOT's TString)

ktRolster · on Sept 6, 2017

shrug It doesn't fall over. I've done it, the openBSD team has done it. DJB has done it. Maybe something is wrong with your implementation that I can help you with?

WalterBright · on Sept 6, 2017

I'm curious. Got links?

ktRolster · on Sept 6, 2017

OpenBSD takes a fairly minimalist approach, which is vaguely described here: http://www.freebsdforums.org/forums/showthread.php?threadid=... They basically replace the unsafe functions with things that are easier to use. Their idea is that it isn't the format of the C-string that causes security issues (null-terminated string), it's the poorly defined functions (with weird corner cases that are hard to get right). It's worked well for their use cases.

DJB did something similar in qmail, I don't recall the details but you can look at the source code as easily as I can, and it eliminated security problems.

When I'm working in Java, I find that most of my string parsing uses the split() function. This is a pain in C, because even if you had a split() function you'd need to deal with memory allocations. Most of these are solved with a memory pool. In my own library, I also added runtime, grammar-based parsing functionality. So to parse a CSV line you might do something like this:

    char *g = " S   -> WORD | WORD , S;"
              "WORD -> [^,]";
    results = parsegram(g, inputString);

Grammar parsing + memory pools makes string parsing in C easier than in Java. The biggest difficulty with this kind of library is to do it right, you need to be something of a unicode expert, and that's tough.

WalterBright · on Sept 6, 2017

I used snprintf(), too, but it is only a minor improvement. Problematic in C is something as simple as concatenating strings:

    Mystring s,t;
    t = "hello";
    t = cat(s,s);
    t = cat(s,s,s);
    t = cat("hello",s);
    t = cat(s,"world");
    t = cat("hello","world");

Even such a simple use case is fraught with major problems:

1. who allocates needed memory?

2. who free's it?

3. can the compiler constant fold cat("hello","world") ? Does the result wind up allocating memory anyway?

4. what about the lack of function overloading to handle the permutations?

JdeBP · on Sept 6, 2017

Here's roughly what that would look like using Bernstein's C string library (which was not only used in qmail).

    #include "stralloc.h"
    ...
    static stralloc s, t;
    ...
    if (!stralloc_ready(&s, 0)) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cats(&t, "hello")) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();
    if (!stralloc_cats(&t, "world")) die_nomem();

WalterBright · on Sept 6, 2017

Yes, that does work. But it's not without problems, not the least of which it's just not attractive to look at. For example, concatenating "hello" and "world" allocates memory, when it should instead give you a "helloworld" string literal. In fact, simply initializing `s` with a string literal needlessly allocates memory, and that's anti-ethical to performance. Calling die_nomem() leaks memory if it does anything but terminate the program. All those tests for memory exhaustion are tedious.

clarry · on Sept 6, 2017

> Even such a simple use case is fraught with major problems: > > 1. who allocates needed memory? > > 2. who free's it?

That's also a major feature. It allows people to write systems that are resilient in the face of tight memory limitations. It's not cool when a language forces string operations to allocate & duplicate memory willy-nilly.

> 3. can the compiler constant fold cat("hello","world") ? Does the result wind up allocating memory anyway?

I fail to see how that's a major problem. Why are you concatenating string literals? How common is that?

> 4. what about the lack of function overloading to handle the permutations?

I consider lack of overloading to be a feature. Overloading is one of the things that are way too easily abused, and it makes code auditing harder than it needs to be. Please just type out the different function names so I can see exactly what is going to be called when I read the code. Or use the sprintf family of variadic functions.

mike_hearn · on Sept 6, 2017

It's the opposite. I've seen lots of code written in C that pretends to be out of memory safe. I've never once seen such a program that actually is out of memory safe. Invariably the codepaths triggered by malloc returning null are never exercised.

With a GC and exceptions you can theoretically be quite resistant to OOM conditions, not that anyone really cares.

pg314 · on Sept 6, 2017

> I've never once seen such a program that actually is out of memory safe. Invariably the codepaths triggered by malloc returning null are never exercised.

sqlite takes care to correctly deal with out of memory conditions. It has explicit tests for that code too. See section 3.1, Out-Of-Memory Testing, of [1].

[1] https://sqlite.org/testing.html

mike_hearn · on Sept 6, 2017

Now I found my first program that actually tests it properly :)

I knew you had to systematically drive the code through every OOM codepath to even have a shot at doing that in an unmanaged language. Sadly a lot of C code is written by people who think:

    if ((ptr = malloc(sizeof(struct foo))) == null)
        return -1;

is the same thing as being OOM safe.

clarry · on Sept 6, 2017

One of the things with tight memory systems is that you don't use malloc to begin with, if you can avoid it. C gives you the option.

When you're concatenating strings, you already have storage for those strings. Maybe you can re-use that storage. Maybe you have a static buffer. Maybe you have a fixed size buffer on the stack and the stack use is bounded.

A language that forces you into making redundant duplicates onto the heap is terrible in these situations.

And yes there are programs that try to deal with failing mallocs. Again, C gives you the option.

WalterBright · on Sept 6, 2017

Very, very few C programs can handle running out of disk space. This includes the operating system(s). Get close to filling up the disk, and try various things.

Just recently, I was having a lot of trouble with Windows Update hanging. I finally noticed that free disk space was low. Freed up more space, and WU started working again.

For fun, try:

    #include <stdio.h>
    int main() { printf("hello world\n"); return 0; }

and redirect stdout to a file on a device that is full. Amazingly, it succeeds!

ktRolster · on Sept 6, 2017

I assume you're referring to OpenBSD here, they didn't use snprintf(). They used asnprintf(), which solves the problem of who should allocate (but not who should free).

WalterBright · on Sept 6, 2017

From the link:

"That means that we have been going through the tree cleaning out all calls to sprintf(), strcpy(), and strcat(). Instead, these things are being rewritten to use asprintf(), snprintf(), strlcpy(), and strlcat()."

Maybe the author made a typo.

ktRolster · on Sept 6, 2017

Oh yeah, you're right.

Another thing I've done that will work if you have a lot of strcat(), is make a string struct:

    ktString {
       int len;
       int memlen;
       char *str;
    }

It keeps track of the string's actual length, and the size of the underlying buffer. Then you can 'override' the various string functions:

    bool ktStrcat(ktString s1, ktString s2);
    bool ktSprintf(ktString s1, ...);

These functions will take care of buffer-size checking, and reallocation if necessary. For cases where you need to interface with pre-existing libraries, you can return a cstring(). Make it a function/macro to enable you to change the struct definition in the future:

     #define ktCstr(x) (x)->str

then you can pass it into write() or whatever you need:

    write(sock, ktCstr(s), s->len);

lomnakkus · on Sept 6, 2017

... and end up with silent truncation unless you happen to always remember to use only C library functions with explicit length arguments (and which do not assume NUL-terminated strings).

Look, I get that there is a place for C, but string manipulation is absurdly bad and error-prone.

ktRolster · on Sept 6, 2017

Hi! I can't imagine how you understood what I wrote. I specifically said to not use those C library string functions.

I fully admitted that string manipulation is absurdly bad and error-prone, then built on that by showing a way to make it better. Use ktStrcat() instead of strCat(), then you don't have to worry about truncation. Use ktSprintf() instead of snprintf(), then you don't have to worry about truncation. I wish you had understood.

lomnakkus · on Sept 7, 2017

Yes, I agree. If everyone would just avoid those C stdlib functions everything would be peachy. :)

I was agreeing with you, but just adding caveats. :)

Well, except... some problems surface when interfacing with "things" (libraries, OS'es) written by other people... and there's no escaping those problems, fundamentally. It's C. Of course UTF-8 was invented with the express purpose of being "C-compatible", but... what happens if you have a string with a NUL in it and you pass that to the POSIX (I think?) printf function as an argument for a "%s" format string? Well, it gets truncated. Did you mean for that to happen, or didn't you? Who knows? That's the problem.

Honestly, I'm not trying to win "internet points" or something. It's just that C, as I'm trying to point out, is a bad language for almost everything that's required for a "user-facing" languages these days. Write the thing in C#, Java, O'Caml, Qt[1], or Haskell, or whatever... but please don't think you need to write in a sort of weird approximiation of the old PDP.

[1] Yeah, yeah, not a language, but it's at least an ecosystem that seems to be moderately successful.

camgunz · on Sept 6, 2017

I think the mental model isn't the issue, it's that the C Standard Library is very anemic. When writing a C application either you're using a big library like APR or GLib or you're rolling your own, and since rolling your own is a pretty big, complicated, and fraught proposition it's no surprise bugs creep in. Furthermore you can't really interop with other libraries if they also rolled their own data structures because theirs probably aren't like yours. Consequently libraries tend not to do that at all, setting for things like NULL-terminated lists and special, opaque data structures.

I feel like if someone wants to throw C a life vest, they should start with a meaningful standard library that engineers can build on to provide functionality we pretty much consider standard now (HTTP libraries, JSON libraries, database libraries) with a consistent interface.

carlmr · on Sept 9, 2017

Completely agree. It's like when people say programming Python is fast. No it just has almost everything pre built and you glue it together.

C's standard library is just sad.

ktRolster · on Sept 6, 2017

It's not just the mental ease, it's also the physical typing ease (and in some cases, the possibility).

For example, he points out that to connect C to existing parts of the system (which is the OS and OS level tools), all you have to do is call the functions. If you want to call a C library from a Java program, it's a lot more work. Furthermore, C has the capability of understanding Java structures (although it's awkward), but Java has no way of understanding C structures from within the language. There is no way to model a driver I/O port in Java, but in C there is.

The paper is worth thinking about. If you are creating a language, take interoperability between already existing languages into consideration. JNI is ok, but think how much better it could be if it did auto-marshalling of objects!

mike_hearn · on Sept 6, 2017

That's not inherent to the Java language. You can implement garbage collectors and kernels in Java if you extend the JIT compiler:

http://jnode.org/

svachalek · on Sept 5, 2017

I've been out of the C world for a long, long time but it seems to me that anywhere that C's pointer arithmetic and ability to cast pointers to/from other types is objectively appealing, that's going to be one of the cases that a compiler can't understand.

Of course there's always the subjective "everything looks like a nail" usage as well, which makes every problem seem like a pointer problem because you've never tried to think of them as anything other than a pointer problem. I'm sure you could cater to that usage with a proper runtime but really, it doesn't hurt to try new things sometimes...

dbcurtis · on Sept 6, 2017

In my case, nearly 100% of the C code I write is for embedded systems. Casting a hex literal to a pointer type that is a volatile hardware register is better than dropping into asm....

So yes, compilers will always have a hard time understanding device drivers and such unless you turn hardware device concepts into language primitives.

swiley · on Sept 6, 2017

A more correct thing would probably be to create linker scripts that expose symbols for the registers. It's probably not worth the trouble now but the hypothetical compiler would understand it better.

speleo_engr · on Sept 6, 2017

There still needs to be a description of the underlying hardware behavior somehow. The hardware engineers often give you a somewhat correct Excel sheet or force you to look at the HDL to figure it out.

bsder · on Sept 6, 2017

C's popularity is due to the fact that it is predictable within certain bounds (single thread or limited concurrency).

No GC pauses, no weird runtime crashes due to a strange constructor, no gigantic exception chains, etc.

The only languages in the TIOBE index that can even try to make that claim are: C at #2, C++(if you subset it) at #3, Objective-C/Swift(#18/#11), Assembly at #14, Ada at #29, and maybe FORTRAN(#35).

That's not a lot of options if you need runtime predictability. Basically C, C(with additions), C(with additions), assembly(hack, spit), Ada (okay), and FORTRAN (God help you).

Even now, that means C or Ada--and the first free Ada compiler was 1992.

Yes, Rust is coming. But it's got a way to go yet.

mike_hearn · on Sept 6, 2017

The idea that C is predictable is in my view a sign of someone who hasn't got to know C really well.

The trends around undefined behaviour will hopefully put a bullet in the head of this idea for good. It's extremely hard to look at C and reason about what an optimising compiler will turn it into.

Malloc is not more predictable than a GC pause. Both malloc and free can take unpredictable amounts of time. If anything it's less predictable because modern GCs at least have pause time targets, but mallocs never do. You just don't notice it because people don't tend to measure malloc latency. In turn that's because malloc pauses only affect memory allocation operations, they don't stop every thread, which is a benefit it's true, but it's not about predictability and more about UI latency.

C not having exceptions doesn't make it more predictable. It just means that if something goes wrong you get a useless and probably corrupted core dump. The number of times I've been able to fix a bug in a piece of managed code given only a stack trace from the end user is huge. The number of times I've been able to fix a bug given "Segmentation fault" with no other info is zero.

bsder · on Sept 6, 2017

> The trends around undefined behaviour will hopefully put a bullet in the head of this idea for good. It's extremely hard to look at C and reason about what an optimising compiler will turn it into.

Sure when you turn on -Oinfinity. Nobody does that in embedded unless they are hard pressed on some metric (RAM size, generally, or CPU flops occasionally).

Overall, though, C is really fairly predictable. Unsigned arithmetic does what you expect--the fact that signed arithmetic doesn't under higher optimizations is a fairly recent phenomenon (and not an uncontroversial one). Variables go where you expect. Pointers act like you expect. Casting and precedence sometimes sneak up on you, but parentheses generally manage that.

Const has issues at the boundary cases. Trying to stuff something into ROM and then telling the rest of the system that "really-no-you-cant-cast-that" can make things tricky with "incompatible pointer" issues.

Floating point arithmetic, though, is just a disaster.

> Malloc is not more predictable than a GC pause.

Ayup. And what's the first thing real-time embedded folks do? Throw out malloc (which is library, not language, but that's pedantic). Real-time-embedded systems tend to allocate all memory statically, up-front. Or they use a custom malloc that they control the behavior of.

> C not having exceptions doesn't make it more predictable. It just means that if something goes wrong you get a useless and probably corrupted core dump.

Predictable and useful are orthogonal.

And, the fact that I can't attach to running state of a crashed program is a failure of TOOLS not the language. The fact that I can't attach to a system that crashed, examine the state, fix what I need to, and continue is a fault of the people who make C IDE's. There is no reason other than lack of monetary incentive that this cannot be done.

eadmund · on Sept 6, 2017

Once upon a time it was common to write non-consing Lisp code precisely in order to get predictable behaviour; I think that it worked pretty well. Non-consing code won't have GC pauses; it won't have weird runtime crashes; and it probably wouldn't have gigantic exception chains unless it needs them.

benlorenzetti · on Sept 6, 2017

C's popularity is due to the fact that it is predictable within certain bounds (single thread or limited concurrency).

Your post, and reading a discussion further down about Rust's reference counting, has made me realize something primitive that Rust is getting right--a real move forward--which even those who don't enjoy the default "safety switch" being flipped from C (like me) may agree.

The C model for memory in time and space is so clean for heap data and the function call stack for one thread (plus global registers), but C has no community-understood/concurring model when it comes to concurrency.

Rust, older C++ libraries, C malloc implementations, and other are all alluding toward the simple memory model for multiple threads, which is reference counted pointers, IMO. Basically use a separate type of pointer for heap data, where the max size of the heap is divided by whatever binary power of 2^p processors exist.

Rust folks or other languages are welcome to add more ownership semantics or whatever, but the whole family of languages could benefit by this extension to the lingua Franca of C. We may not even need to add a new nominal pointer type to C, just by fiat understand and expect shared, free store objects to always live inside the lowest 1/p th portion of the word address space.

Volt · on Sept 6, 2017

The fact that there are these other languages with the same properties means that predictability isn't the real reason, right? It's that it also is sparse in its specification and easy to implement a compiler for.

bsder · on Sept 6, 2017

Please name those languages. I'm serious.

pjmlp · on Sept 8, 2017

Ada, Modula-2, Pascal dialects, Algol 68, PL/M, PL/8, PL/S, NEWP...

fulafel · on Sept 6, 2017

Isn't this a circular argument: C is popular because no other language on the top popularity chart does <x>.

There are many languages with the these properties and better safety, but they aren't popular like C.

bsder · on Sept 6, 2017

Really? I'd love their names. I'm not being snarky here.

I'd love to have a nice language alternative to C.

fulafel · on Sept 7, 2017

I'm going to link to pjmlp's comment who knows more about these: https://news.ycombinator.com/item?id=14700251#14701140

I was more referring to the historical perspective of how C became popular, many have fallen by the wayside. Though there are certainly current alternatives to C besides those on TIOBE.

Also there are real-time extensions to current GC'd languages like Java.

(Though standard C isn't very predictable timing-wise either, or suited to real-time work)

bsder · on Sept 7, 2017

And, with the exception of Ada and Pascal, most of those language have been dead for at least 20 years--for various good reasons.

And, please do remember that Apple switched away from Pascal when writing its operating systems in spite of an enormous code base. That's pretty damning--apparently C's "undefined behavior" didn't seem to matter.

So, we're back to: the only alternative to C is Ada.

> Though there are certainly current alternatives to C besides those on TIOBE.

Let me make it easy. Give me a list of languages that have been used to build an operating system in a product in the last 20 years. It doesn't have to be Linux, even a small RTOS counts.

I'll start the list:

C family--C, C++, ObjC/Swift

Forth(?)--probably counts as it runs on pretty bare metal

Ada--not sure anybody has used it to build an OS, but I don't debate that they could

Rust--has a feature set of articles about this

Pascal--the original Lisa and Macintosh OS (probably stretching that 20 year limit a bit).

And?

pjmlp · on Sept 8, 2017

Apple switched away from Pascal due to UNIX market pressure.

http://basalgangster.macgui.com/RetroMacComputing/The_Long_V...

And it was mostly to C++, not C.

Pascal is used daily for embedded system work by MikroElektronika customers using mikroPascal.

https://www.mikroe.com/mikropascal/

Any embedded application using Ada's Ravenscar profile, is an OS.

https://en.wikipedia.org/wiki/Ravenscar_profile

Additionally regarding Pascal, it was used to build Corvus Systems OS, MicroEngine and Solo OS.

Modula-2 was used to build Lilith and Delco's engine control units.

Mesa was used to create Xerox Star workstation.

ESPOL followed by NEWP was used for Burroughs B5500 in the 60's, nowadays still sold by Unisys as ClearPATH.

IBM created their RISC architecture OS using PL/8, with a compiler architecture that now re-resurfaced in LLVM.

OS/400, nowadays known as IBM i, was originally written in PL/I.

If it wasn't for UNIX's adoption, C would have joined many of those languages many moons ago.

bsder · on Sept 9, 2017

So, C, Pascal, and Ada with a smidge of effectively dead languages.

Okay, sad to know I'm not missing anything.

fulafel · on Sept 7, 2017

I am happy with your list of languages to write an OS in, maybe add D and Oberon. I'd point out that you can also use managed languages, see MS Singularity, or the various Lisp and Smalltalk operating systems, or the UCSD P-system, etc - there is a list at https://en.m.wikipedia.org/wiki/Language-based_system .

Counting new commercial operating systems is not a useful benchmark as they are very rare, and we already agreed that the alternatives are not popular.

tyingq · on Sept 5, 2017

"Consider the idea of the Sufficiently Smart Compiler[1] that claims that a "slow" and "high-level" language like Python/Ruby could be theoretically analyzed and compiled to be as fast as C or handcrafted assembly."

Nim seems to be trying to fit that space.

pcwalton · on Sept 5, 2017

This is another article overanalyzing the success of C, when in fact the reason for the success of C is very simple and obvious: Unix was free and in a lucky position in 1973; Unix got popular; C is the language of Unix; therefore C got popular. There is no inherent benefit in C that, for example, a somewhat modified version of Pascal or Algol wouldn't have inherited. And these kinds of articles always ignore the fact that in the past decade or so, C and C++ have been declining in popularity. By and large, new programmers are not learning C the way they were in the '90s. For better or worse (personally, I think, for the better), they're starting with JavaScript, Python, Ruby, or even PHP.

I'm highly skeptical of the conclusion that what we need is a new safer implementation of C, too. Switching to a new compiler is a very high burden for a lot of projects, and at the end of the day they're still left with all the problems of C, like header files, no namespaces, terrible standard library, etc. etc. (Even adding compiler switches is a high burden, which is why Linux distros took so long to widely deploy basic things like -fstack-protector.) By contrast, switching to a new language (or incrementally writing new components in a new language, which is how this always goes in practice) is also a very high burden, but the benefits are larger: you don't have to deal with all the problems of C.

In my view, this is why safer versions of C have repeatedly failed over the years, while new languages have flourished. Migration to a new language or a new compiler is expensive no matter what, so teams will only do it if they see enough benefit to justify the expense of doing so. Merely adding some amount of safety to C isn't worth it, but the large safety and productivity gains you can get from a different language can be.

dig1 · on Sept 5, 2017

> Switching to a new compiler is a very high burden for a lot of projects ... By contrast, switching to a new language ... is also a very high burden, but the benefits are larger

I guess switching to a new compiler (or newer version of the same vendor) is much less burden than switching to the new language.

Don't forget that all "new safe languages" are simply new. Why people like C is familiarity: known practices and known issues to avoid ironed out through 30 years of usage.

Although Rust/JavaScript/your-favorite-new-language brings on table fixes for known C issues, they introduces many unknown things. Remember Java? It promised compile-once/run-everywhere, automatic memory management approach, but introduced bloat, extremely hard to catch GC leaks no one talks about, JVM implementation differences (Oracle JVM vs IBM JVM vs OpenJDK speed) and JVM security issues only few can fix.

> why safer versions of C have repeatedly failed over the years

I guess this will be like giving to skilled hunter a toy water gun - simply a different mindset. Imagine unsafe python with pointers and mallocs; how python devs would deal with it?

pcwalton · on Sept 5, 2017

> I guess switching to a new compiler (or newer version of the same vendor) is much less burden than switching to the new language.

If that were true, then Linux distros wouldn't still be using GCC. Switching to a new compiler (like clang) is a huge burden.

Both switching compilers and switching languages are enormously expensive, to be sure. But I think people (especially people in academia) consistently underestimate the cost of switching compilers and overestimate the cost of writing new components in a different language.

> Don't forget that all "new safe languages" are simply new. Why people like C is familiarity: known practices and known issues to avoid ironed out through 30 years of usage.

Most programmers haven't been programming for 30 years. New programmers, by and large, don't even know C anymore.

The problems with C haven't been so much "ironed out" as ignored since C99.

> Remember Java? It promised compile-once/run-everywhere, automatic memory management approach, but introduced bloat, extremely hard to catch GC leaks no one talks about, JVM implementation differences (Oracle JVM vs IBM JVM vs OpenJDK speed) and JVM security issues only few can fix.

You bring up Java as though it were a failure! Java has in fact been beating C++ in total usage for years. If I could point to one thing that was responsible for kickstarting the slow decline of C++ that has continued to this day, Java would be it.

panic · on Sept 6, 2017

If that were true, then Linux distros wouldn't still be using GCC. Switching to a new compiler (like clang) is a huge burden.

FreeBSD switched to clang, and could (with some work) be made to use whatever safe C compiler people come up with. That's much easier than rewriting the entire FreeBSD system in a new language.

dig1 · on Sept 6, 2017

> If that were true, then Linux distros wouldn't still be using GCC.

Simply because there is no better alternative (clang doesn't bring anything new). However, distros switched to egcs fork when gcc wasn't up to date.

> New programmers, by and large, don't even know C anymore.

New programmers are interested in web, just like they aren't interested in desktop GUI development. Should I say that desktop is dead? I don't see we are booting usable machines in browsers yet.

> The problems with C haven't been so much "ironed out" as ignored since C99.

I think you need to hang more with embedded/kernel/C devs more and get insight into their mindset. They aren't interested in new stuff as much as in language stability.

> You bring up Java as though it were a failure! Java has in fact been beating C++ in total usage for years.

You read it wrong - I haven't said Java failed, but introduced new stuff to cope with. Java should thank it's popularity to huge ecosystem and library, Sun's aggressive marketing, jvm and extreme language stability.

> If I could point to one thing that was responsible for kickstarting the slow decline of C++ that has continued to this day, Java would be it.

Did I mention C++ here?

flukus · on Sept 6, 2017

> If that were true, then Linux distros wouldn't still be using GCC.

You're presupposing they want to switch. Most of the switchers to clang seem to have done so for ideological reasons more than anything.

Never the less, most of debian can be built with clang: http://clang.debian.net/

nickpsecurity · on Sept 6, 2017

Great overview. To support it on the design side, the video below shows the evolution of the language from CPL to BCPL to B to C. In it, you see they don't start with what's great for analysis, safety, or efficiency so much as what can compile on terrible hardware. Thompson's modifications are a mix of arbitrary and what will make it work on a PDP. Ritchie enhances a bit for operating systems. This is start contrast from the careful design of languages like Ada or Modula-3 balancing expressiveness vs safety vs performance. No surprise a bunch of problems followed. And same ones today, 30+ years later, in average app even with better tooling available since the language itself defaults on making simple stuff require extra work to do safely. Not necessary as Wirth, Morrisett, and others showed.

https://vimeo.com/132192250

AnimalMuppet · on Sept 5, 2017

I've used both C and Pascal in embedded systems. Pascal is painful compared to C. A "somewhat modified" version might help, but I doubt it would be enough. To steal a phrase from my friend Michael Pavlinch: Pascal was like picking your nose with boxing gloves on. A modified boxing glove isn't really going to solve the problem.

For that matter, once we weren't on Unix but rather on the PC, and we had a nicely-modified Pascal (Turbo Pascal), why did C/C++ win there, too?

pcwalton · on Sept 6, 2017

> For that matter, once we weren't on Unix but rather on the PC, and we had a nicely-modified Pascal (Turbo Pascal), why did C/C++ win there, too?

Turbo Pascal was quite successful in its day. But Microsoft chose C, and the rest is history. Absent Microsoft's decision, Pascal might still be around.

If you look at early Mac development, for instance, Pascal was actually preferred. C only ended up winning due to being better known, which was a result of the critical mass of programmers trained on Unix and Microsoft's offerings.

jstelly · on Sept 6, 2017

That doesn't agree with my experience. I switched from Turbo Pascal to Turbo C in the late 80s while doing DOS development because it was a better tool for the job. It had nothing to do with microsoft or windows (v3.0 was not yet out and few people developed windows apps before v3.0). Pascal (the language) was definitely not preferred for DOS development at that time - it's just that until 1987 there wasn't really C development environment that could compete with Turbo Pascal.

I did some Amiga development back then also and that was exclusively in C with some 68k assembly. I don't really recall anyone hoping for a pascal environment to replace their C tools, but the Amiga OS was more C-oriented than DOS at the time.

WalterBright · on Sept 6, 2017

> But Microsoft chose C

Ironically, the early Windows API used Pascal calling conventions.

clouddrover · on Sept 6, 2017

> Pascal might still be around.

Well that's a bit rude. Pascal is still around! And I love it. Behold:

https://www.freepascal.org/

https://www.embarcadero.com/products/delphi

nly · on Sept 6, 2017

> why did C/C++ win there, too?

Maybe C because of its legacy an ubiquity, and C++ because it was one of the few languages with 1) serious compatibility with C, 2) good featureset, and 3) became an ISO standard

iainmerrick · on Sept 6, 2017

Maybe C because of its legacy an ubiquity

Not in the 80s.

mjw1007 · on Sept 5, 2017

Basically because Microsoft picked C++ to be the main language for Windows development.

Turbo Pascal was all very well but from Microsoft's point of view it was Not Invented Here.

WalterBright · on Sept 6, 2017

C++ wasn't invented at Microsoft, either. The DOS C++ train had already left the station (Zortech C++) and Microsoft wasn't about to be left behind.

(Zortech didn't invent C++, either, I don't want to give that impression.)

pjmlp · on Sept 8, 2017

Actually if I remember correctly they were the very last PC C compiler vendor to add support for C++.

igouy · on Sept 6, 2017

> Pascal in embedded systems

aka Modula-2

http://cms.edn.com/ContentEETimes/Documents/ESC%20Proceeding...

iainmerrick · on Sept 6, 2017

I never liked Pascal and I love C, but Bill Atkinson did amazing things with Pascal (see for example https://www.folklore.org/StoryView.py?story=Hungarian.txt) so it must have something going for it.

quickben · on Sept 5, 2017

You are ignoring the fact that many problems can't be solved in the higher level languages.

Also, for some, having the c/c++ level of control is prefered.

bluejekyll · on Sept 5, 2017

This doesn't have to be zero sum. We don't need to choose between safe and unsafe. Safety should be a default, with unsafety being something you opt into.

C/C++ are both unsafe. Rust is safe by default, and for the cases where you want/need the C/C++ level of access to the system, you can opt into unsafe. I believe Rust is more safe than Go and Java as well, because of the type safety in the threading model.

There are existing applications in C out there, there will be for a long time. Personally I frown upon anyone starting a new project in C or C++, or any unsafe language, especially when you have such a strong language in Rust that exists with a very healthy and growing community around it.

jstewartmobile · on Sept 5, 2017

Notice that "scoff at" turned to "frown upon" here, but same difference.

I like Rust, but the kind of attitude you've shown in this comment is a) distressingly common among Rustaceans, and b) makes everyone in the embedded world who would like to get away from C/C++ keep the whole Rust ecosystem at arms-length.

bluejekyll · on Sept 5, 2017

Scoff: speak to someone or about something in a scornfully derisive or mocking way.

Frown: furrow one's brow in an expression of disapproval, displeasure, or concentration.

Those are definitely not the same thing. Language is deliberate. I'd prefer it if you don't change mine to mean something I do not.

I frown at a lot of things in software I review, then someone explains to me why they decided to do something the way they did, in which case they may convince me that they are correct. In the case of C/C++, you can convince me easily in the embedded space that C is still the best choice, and I'd agree. I wouldn't even debate it, I might personally go try and see if there is an option there, but it's clearly a space that Rust is still getting bootstrapped into.

jstewartmobile · on Sept 5, 2017

You wrote "scoff" first before you edited it, that's why gens wrote "You can scoff all you want". From the tone of your comments, I think that's what you're really doing.

gens · on Sept 6, 2017

No, he did not. Not as far as i remember.

I am not British, and i did want to dramatize it a bit. In my defense, words are to convey meaning as much as to describe action. (you don't literally "beat a dead horse" or "fart in your general direction")

EDIT: In bluejekyll's defense, we here are a bit chafed (hehe) by extreme fans of certain programming languages (IMO even some functional programming fans get a tad too.. unrealistic in their talks)

bluejekyll · on Sept 6, 2017

hmm... I did not. But ok.

speleo_engr · on Sept 5, 2017

You sure would frown a lot in the embedded world. Rust isn't even a contender for most hard-real time or safety critical projects. Is it even supported by any RTOS today?

bluejekyll · on Sept 5, 2017

Recently I had the pleasure of learning how to develop for this https://www.tockos.org/, which was a lot of fun.

But yes, I can still frown at the C in the embedded world, and bite my tongue when forced to use it. It's only a matter of time IMO before Rust gets onto more of these devices.

gens · on Sept 5, 2017

I don't like to be told what programming language i should use. You can scoff all you want, but i will be using C for most of my projects.

bluejekyll · on Sept 5, 2017

I don't "tell" people, what language to use. But I do encourage them to look at new languages, especially ones that fit in the same place as one that they like.

I love C. It was my first programming language. I love the syntax, I love the semantics. Years ago, I left it though, because I could deliver higher quality applications with fewer unknown bugs with Java, but I always wanted to find a reason to go back to C. And for some projects I did, and it was important (and in each case I'd run into some error or bug that took weeks to track down).

When I grabbed Rust 2 years ago, it was like getting everything I loved about C paired with everything I love about Java, and none of the stuff I hate in either. Feel free to use whatever language you wish, and I hope schools continue to teach C (though that seems to be dwindling), but I highly encourage people to checkout Rust and see what it's like to not worry about pointer management all the time.

Const-me · on Sept 5, 2017

When I checked out Rust a year ago, I immediately discovered it has no SIMD (SSE, AVX, Neon). That it has no sane ways to implement a graph structure. Also that it’s hard to compose data structures into higher-level specialized ones.

Also, I highly encourage people who worry about pointer management all the time to checkout modern C++.

pcwalton · on Sept 5, 2017

> That it has no sane ways to implement a graph structure.

Sure it does. I work with graphs in Rust all the time.

They may not be "sane" in your view because they're different from the way you implement them in C++, but I could equally well say that there's no "sane" way to implement a safe owning pointer in C++ (since there's no memory-safe way to do so).

> Also, I highly encourage people who worry about pointer management all the time to checkout modern C++.

Modern C++ has no protection against the most pernicious memory management errors, particularly use-after-free.

nly · on Sept 6, 2017

> I could equally well say that there's no "sane" way to implement a safe owning pointer in C++

unique_ptr + std::move gives you that semantically. Sure it won't stop you from dereferencing a null pointer but, in all the years i've been writing C++, finding and fixing null pointer dereferences wouldn't rank very high on my list of things to worry about. They always kill your program and are easy to spot in an IDE or debugger.

Rusts choice to make pointers either mutable and owning, or shared and immutable and garbage-collected, is no doubt the right one, but there are code-styles in C++ where this can be achieved with a very low fuck-up rate.

The modern C++ way is not to use pointers, except as an implementation detail. A pointer (raw or smart) of any type, other than perhaps char*, as a function parameter is a sure sign of code smell, and raw pointers as data members have very limited use.

pcwalton · on Sept 6, 2017

Unique pointers provide no protection against use after free, because you can take a reference to their contents and that reference can become dangling. Because the destructor of a unique pointer is invoked automatically per the language rules, as opposed to in C where an explicit call to free is required, this makes C++ more prone to UAF than C.

blub · on Sept 6, 2017

It's unusual to take references to the contents of a unique pointer. There is one idiom which says that if one has a smart ptr and a function taking a ref, the raw ptr should be passed, but that's it. It's frowned upon... nay scoffed at to store references one receives as parameters, so that temporary ref will go away after the function call, leaving the smart ptr as unique owner.

This should not be a problem and it certainly doesn't make C++ more prone to use after free. Null pointers are the problem.

Both can be solved though by creating e.g. a safe smart ptr which does null checks and only exposes operator->.

pcwalton · on Sept 6, 2017

> It's unusual to take references to the contents of a unique pointer.

No, it's not. It happens every time you call a method on the referent (well, OK, this is technically not a reference, but it doesn't matter to the argument).

blub · on Sept 7, 2017

One is not taking the reference. When writing ptr->foo() the ref is temporary, not accessible and will be cleaned up when the method call on the raw ptr finishes.

Taking the reference would be "auto foo = ptr.operator->()". This could be forbidden by not providing op->, and instead having an apply function which takes a method name and the parameters. That would be safer, but probably too much effort for little gain.

nly · on Sept 6, 2017

If you're calling a method on the object referred to by a unique_ptr, then you won't have a use-after-free because the thing will exist. The only way it wouldn't exist would be if you typed "delete myuniquePtr.get()", which would be dumb.

It could be a null unique_ptr of course, but I don't see how this is anything worse than a denial-of-service.

pcwalton · on Sept 6, 2017

No, you could get a reference to the container of the object and indirectly delete it. For example, if the unique pointer were part of a global std::vector, clearing the vector would invalidate the this pointer.

Keep in mind that you are at this point arguing against the existence of actual zero-days that have occurred in Firefox (and lots of other software). This is not a theoretical concern.

nly · on Sept 6, 2017

This goes back to what I said about using pointers as implementation details. If you put unique_ptrs in to a _global_ then you're doing something stupid already. Just think about it. Why does something that can only be pointed to from one place need to visible globally in the first place?

My point is all Rust does is force you stop and think, and while C++ lets you do dumb shit, it's hardly fair to blame the language when almost-safe C++ is actually cleaner and easier to read and write than dumb C++.

Build your own handle types with well-defined ownership semantics, use explicit move() sparingly, pass objects by value, use references, utilize the stack and temporaries. Only put pointers inside the guts of classes and your data structure implementations. These techniques go quite far.

And Firefox as an appeal to authority is hardly compelling. As another, I've seen bug fixes in Chromium where the original code quality is so poor it was hard to believe it came out of Google. Of course, since then i've learned most C++ out of Google is total crap.

nly · on Sept 6, 2017

> Unique pointers provide no protection against use after free, because you can take a reference to their contents and that reference can become dangling.

Yes, it's possible, however references should only ever have local scope so, unless you're dealing with threads or asynchrony it's hard to write sane code where this happens, and if you have those things, and you're passing refs or ptrs, then I don't have to tell you that things are bad.

> as opposed to in C where an explicit call to free is required, this makes C++ more prone to UAF than C.

I don't see this. C++ destructors run after the last line of your code block, if something uses a destructed object then it can only be because you passed a pointer or reference to it to something else that wasn't yet destroyed.

Mentioning free() just implies that you're willing to accept resource leaks to avoid UAF bugs, which is nuts because UAFs can be a lot easier to debug.

cesarb · on Sept 6, 2017

> Mentioning free() just implies that you're willing to accept resource leaks to avoid UAF bugs, which is nuts because UAFs can be a lot easier to debug.

If you're focused on security, it goes in the opposite direction: a resource leak can lead to a denial of service, but an use-after-free can lead to remote code execution, which is much worse. From that point of view, it's worth it risking a resource leak if by doing that you prevented a potential instance of remote code execution.

By the way,

> C++ destructors run after the last line of your code block

Aren't there many situations where the C++ destructor runs at the end of the current statement? IIRC, if you call a function which returns a temporary, then call a method on that temporary which returns a reference to within the temporary, and assign the result to a variable, all in a single statement, the temporary will be destructed while the reference to its contents is still live.

nly · on Sept 6, 2017

> then call a method on that temporary which returns a reference to within the temporary

The obvious answer to this would be never to return references to members (or anything tied to the objects lifefime), but if you really must then you can always use a qualifier to prevent this pattern from compiling.

https://ideone.com/UuZYJe

Const-me · on Sept 6, 2017

> They may not be "sane" in your view because they're different from the way you implement them in C++

I saw two kinds of Rust graphs.

One is safe, easy to understand, but slow (e.g. reference counting).

Another one (unsafe Rust) is very hard to implement, thousands lines of code. Also, modern C++ is much safer than unsafe rust.

> C++ has no protection against the most pernicious memory management errors, particularly use-after-free

CRT debug heap / MALLOC_CHECK_ / libefence, depending on the platform/compiler

bluejekyll · on Sept 6, 2017

I can't reply directly to your other comment, so I'll do it here. In the case of pet-graph, I wouldn't say it has large amounts of unsafe (I reviewed it recently b/c I might start using it), but it does use unsafe. Most data structures in Rust require unsafe for performance or memory access patterns. In these cases the developer is the one responsible for guaranteeing that it is safe (no loss over an unsafe language).

> modern C++ is much safer than unsafe rust

This is an interesting comment. I can't say if you are right or wrong, but it's thought provoking. So I'll quote the rustinomicon here:

    Unsafe Rust is exactly like Safe Rust with all the same 
    rules and semantics. However Unsafe Rust lets you do 
    some extra things that are Definitely Not Safe.

    The only things that are different in Unsafe Rust are 
    that you can:

    -Dereference raw pointers
    -Call unsafe functions (including C functions, intrinsics, and the raw allocator)
    -Implement unsafe traits
    -Mutate statics

Point being Rust doesn't just throw out all the rules. But it's a very interesting assertion.

https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html

Const-me · on Sept 6, 2017

> I can't reply directly to your other comment

Next time just wait 5-10 minutes.

> Most data structures in Rust require unsafe for performance or memory access patterns.

And when I want to compose 2 data structures into my own higher-level one, for performance and memory access patterns I need these two lower-level structures to expose unsafe stuff at their API boundaries. The data structures I saw don’t do that, they’re designed to be consumed from safe Rust instead.

> Rust doesn't just throw out all the rules

I think in modern C++, with these iterators and smart pointers, you’re less likely to screw up dereferencing a wrong pointer.

pcwalton · on Sept 6, 2017

> And when I want to compose 2 data structures into my own higher-level one, for performance and memory access patterns I need these two lower-level structures to expose unsafe stuff at their API boundaries. The data structures I saw don’t do that, they’re designed to be consumed from safe Rust instead.

Can you give a specific example of something you want to do that you can't?

> I think in modern C++, with these iterators and smart pointers, you’re less likely to screw up dereferencing a wrong pointer.

I don't think this is empirically true relative to C, but even if it is, use after free is still far too common in C++ code.

Const-me · on Sept 6, 2017

> Can you give a specific example of something you want to do that you can't?

Compose a hash map + linked list into an LRU cache. Rust how has that in standard library, but they had to implement their own linked list for that. In C++ it’s just a few lines of code, because standard maps+lists compose just fine.

Or (more generic example and thus harder to put in the standard library), add an index to existing collection. I have a large collection of some values. I want to build an index allowing to lookup values by some key. Values are not small, can’t afford duplicating them. If you’ll tell “just move values into a hashmap”, my response is “and I also want another, different index of the same set of values by different key”. Again, very easy in C++, encapsulate both the original collection, and a hashmap from key to value pointer.

> use after free is still far too common in C++ code.

In my experience, use after free = instant crash in debug build. Quite easy to detect and fix.

pcwalton · on Sept 6, 2017

> In my experience, use after free = instant crash in debug build. Quite easy to detect and fix.

The security track records of major network-facing C++ apps disagree with you.

burntsushi · on Sept 6, 2017

Use an Rc? If you can't afford the reference count, then use unsafe/raw pointers, which it sounds like you'd do in C++ anyway.

Const-me · on Sept 6, 2017

> then use unsafe/raw pointers

Even if I manage to extract an unsafe pointer from that Rust collection, I don’t know for how long will it work. For C++ collections, iterator invalidation rules tell me that.

cesarb · on Sept 6, 2017

> Even if I manage to extract an unsafe pointer from that Rust collection,

It's easy, just get the & or &mut to the value (as if you were acessing it), and cast it to respectively * const or * mut.

> I don’t know for how long will it work. For C++ collections, iterator invalidation rules tell me that.

It's the same in Rust. Whenever the iterator would be invalidated in C++, the pointer you stashed above might point to the wrong place. This is not usually documented in Rust, because its borrow rules prevent you from stashing a reference while the collection mutates, but once you start playing with raw pointers, the borrow checker gets out of the way (references have a lifetime, pointers don't).

You just have to be careful when casting the pointer back to a mutable ref ("unsafe { &mut *ptr }" is the trick, see the documentation for std::mem::transmute): mutable references are like C99's "restrict", so you should make sure to only ever have one live for each pointer at every moment, otherwise you're in undefined behavior land.

----

Anyway, going back to the parent comment, you said "Values are not small, can’t afford duplicating them". Might I suggest keeping the values in a Box<T> then, and making both collections point to the box? That way, you don't have to worry about a mutation in one of the collections invalidating the pointer, since the contents of a Box won't move in memory.

And in fact, the usual Rust style for keeping a value in more than one collection would be to use a Rc<T>, which is basically a Box with a reference counter. That way, you don't need to play with raw pointers, and have no risk of a misstep. You pay the cost of incrementing/decrementing the reference counter only when adding/removing from the collection, and the reference counter is small.

Const-me · on Sept 6, 2017

> keeping the values in a Box<T> > for keeping a value in more than one collection would be to use a Rc<T>

Indeed, both methods are simple and elegant ways to approach the problems.

The bad thing with both of them is performance.

Box<T> means when I need to iterate through all values in a collection, I’ll get random memory access for each item. Rc<T> is even worse, not only it’s RAM read latency per item, also ref.counting overhead per item (AFAIK even when reading stuff).

cesarb · on Sept 6, 2017

> also ref.counting overhead per item (AFAIK even when reading stuff).

That's the beauty of the borrow checker: no, there's no reference counting overhead when reading stuff. The borrow checker guarantees that the reference you used to access the value won't go away until you're done with it, so it doesn't have to increment the reference counter.

burntsushi · on Sept 6, 2017

You'd have to box your values. And if you didn't want to do that, then I'd just used the indexing method mentioned elsewhere in this thread. I've used such things in performance critical code.

Const-me · on Sept 6, 2017

> the indexing method mentioned elsewhere in this thread

https://news.ycombinator.com/item?id=15180649

pcwalton · on Sept 6, 2017

> For C++ collections, iterator invalidation rules tell me that.

The iterator invalidation rules in Rust are straightforward, more straightforward than those in C++. They have to be, because the compiler actually checks them.

bluejekyll · on Sept 6, 2017

More like 20 it seems ;)

> I need these two lower-level structures to expose unsafe stuff at the API boundary

Two thoughts: 1) I think you can always use unsafe to get access to a raw pointer (I honestly don't use unsafe often) 2) you need someway to express ownership between both data structures, this can be annoying, no doubt.

> C++, with these iterators and smart pointers

Does that make it safer than unsafe Rust? Maybe, but there's a lot less unsafe Rust even in these graphs...

Const-me · on Sept 6, 2017

> I think you can always use unsafe to get access to a raw pointer

I’m not sure about that. Also I don’t know for how long will it work, C++ has iterator invalidation rules.

> you need someway to express ownership between both data structures

Not every relation is ownership. Graph modes don’t own each other, an external index doesn’t own the indexed items, etc.

oconnor663 · on Sept 6, 2017

I'm really not sure, but I think there's a miscommunication about raw pointers in this thread. I think other folks are suggesting that you can solve the container composition problems you're talking about by inserting raw pointers into a HashMap. But I think you're reading that as obtaining raw pointers to the storage that a HashMap owns, which is why you're worried about iterator invalidation rules and stuff like that? (My understanding is that any reference into storage that a container owns is / could be completely invalidated by any &mut operation on that container.)

Const-me · on Sept 6, 2017

To create an external index for the existing collection, I need to get pointer of the item stored in that existing collection, compute a key, and put both into the HashMap<tKey, tValue*>

So, I need to do both. And also, I need to know when these pointer expire so I can rebuild my index when it happens.

oconnor663 · on Sept 6, 2017

I think I understand. The idea would be to have one HashMap<tKey, tValue> that holds the objects themselves, and then a secondary HashMap<tKey2, *tValue> (with a star this time) that indexes on some other key and points to values stored directly in the first map?

What's the benefit of doing that, compared to making both the HashMaps store pointers to independently allocated objects on the heap, such that insertions into one map never invalidate the other map? Is the hope to avoid paying the cost of an extra pointer dereference when we're using the first map? Or does independently allocating each object hurt cache locality or something like that?

Const-me · on Sept 6, 2017

> Is the hope to avoid paying the cost of an extra pointer dereference when we're using the first map? Or does independently allocating each object hurt cache locality or something like that?

Both.

In practice, I probably wouldn’t use a hashmap for the first container that actually owns these items. When I do expect gigabytes of data, in C++ I use something like vector<vector<tValue>>, where the inner vectors are of the same fixed size (except for the last one), e.g. 2-16MB RAM / each. If I need to erase elements, I include a free list such as this one: https://github.com/Const-me/CollectionMicrobench/blob/master...

But the exact container is not that important here. If you don’t have that many values, it can as well be a standard vector.

The point is, C++ allows composing these containers making higher-level ones, such as this indexed array example, using pointers to link individual items across them. This feature allows building sophisticated and efficient data structures that are still possible to reason about.

burntsushi · on Sept 6, 2017

Right, and I don't understand why you think that same exact approach wouldn't work in Rust either. If you have a `Vec<Vec<tValue>>`, then you can spread all the raw pointers you want everywhere without any additional boxing of `tValue`, and you know exactly when those pointers might become invalidated: whenever you call an `&mut` method on your `Vec<Vec<tValue>>` (or rather, on one of the interior `Vec<tValue>`s). Because of that, you can even build safe abstractions on top of such data structures such that your callers can't possibly misuse it (without themselves using `unsafe`).

The technique of giving stable addresses to things by stuffing them into vectors isn't unique to C++. People do it in Rust too: https://github.com/SimonSapin/rust-typed-arena/blob/master/s...

Const-me · on Sept 6, 2017

The exact container is not that important here. The point is, C++ allows composing these containers making higher-level ones, such as this indexed array example.

They can be standard, third-party, my own, I still can compose them.

About my particular example, I’m not sure you can easily implement a free list in rust, to reuse space from de-allocated items. Especially if these items have non-empty constructor and destructor.

burntsushi · on Sept 6, 2017

> The point is, C++ allows composing these containers making higher-level ones, such as this indexed array example.

What I---and others---are trying to tell you is that it's perfectly possible in Rust too. I don't think you've pointed out anything that isn't possible in Rust. My previous comment was exactly about composing containers to make higher-level ones.

Have you tried building such things? Did you get stuck? Maybe someone can help.

> About my particular example, I’m not sure you can easily implement a free list in rust, to reuse space from de-allocated items. Especially if these items have non-empty constructor and destructor.

I don't see any reason why implementing a free list in Rust wouldn't be possible either.

oconnor663 · on Sept 6, 2017

It seems like one of Const-me's objections is that Rust data structures like HashMap don't document a lot of guarantees about when they would and wouldn't invalidate unsafe interior pointers. That said, for Vec in particular, Rust actually makes a ton of guarantees about its layout (more that C++ std::vector I think): https://doc.rust-lang.org/std/vec/struct.Vec.html

Const-me · on Sept 7, 2017

Correct.

While vectors are comparable, C++ also guarantees a lot about the rest of the containers. E.g. unordered associative containers never expire pointers to keys or values. Linked lists never expire pointers nor iterators.

In C++ I can create an efficient LRU cache in a dozen lines of code, combining list<const tKey* > with unordered_map<tKey, struct{tValue, list<const tKey* >::iterator}> (this implies tKey is not an int, otherwise list<tKey> is more efficient). Rust’s built-in LinkedHashMap had to reimplement a linked list instead.

Const-me · on Sept 7, 2017

> I don't see any reason why implementing a free list in Rust wouldn't be possible either.

Is placement new available in rust stable?

burntsushi · on Sept 11, 2017

Why do you think placement new is necessary to implement a free list?

benlorenzetti · on Sept 6, 2017

Object size possibly.

anp · on Sept 6, 2017

If you click directly on the time of a comment, it's a link to a direct reply page.

burntsushi · on Sept 6, 2017

> And when I want to compose 2 data structures into my own higher-level one

Could you give a specific example please? I compose data structures in Rust all the time.

Const-me · on Sept 6, 2017

https://news.ycombinator.com/item?id=15180500

pcwalton · on Sept 6, 2017

There is another kind of Rust graph that uses indices. Indices are just bounds checked addresses, so this is a natural fit. The most popular graph crate, petgraph, is of this type.

> CRT debug heap / MALLOC_CHECK_ / libefence, depending on the platform/compiler

None of these are effective at preventing use after free problems in production.

Const-me · on Sept 6, 2017

There’s another problem with arrays: they can’t be too large.

First reason is address space fragmentation, esp. on 32-bit platforms.

Second reason is insert time can be very high. Sure, the average is usually amortized using exponential growth. The worst case however is horrible, you copy 1GB RAM just to insert another 16-byte item.

cesarb · on Sept 6, 2017

> I saw two kinds of Rust graphs.

There's a third kind, which uses indexes into arrays containing the nodes and edges, instead of direct pointers to the node/edge.

> CRT debug heap / MALLOC_CHECK_ / libefence, depending on the platform/compiler

Can any of these protect against the scenario where a block of memory is freed, allocated again for another purpose, but still accessed through the old dangling pointer?

Also, are they always present at runtime, or are they used only on debug builds and turned off on production? The use-after-free might happen only after a specific sequence of uncommon operations confuses the code enough that it either frees something before its time, or keeps and uses a stale pointer.

Const-me · on Sept 6, 2017

> There's a third kind, which uses indexes into arrays containing the nodes and edges

Pointers are still faster. Also with arrays it’s expensive to reduce RAM usage after a lot of nodes were removed.

> Can any of these protect against the scenario where a block of memory is freed, allocated again for another purpose, but still accessed through the old dangling pointer?

No 100% guarantee, but AFAIR CRT debug heap takes measures to reduce RAM reuse when it can.

> are they always present at runtime, or are they used only on debug builds

Not present. Yes, only on debug builds. Still, these early debug traps are quite helpful while development.

cesarb · on Sept 6, 2017

> Pointers are still faster.

Depending on cache effects, indexes might or might not be faster than pointers. With indexes into an array, the nodes or edges will be sequential in memory, which depending on their size and access patterns might increase the cache hit rate. Furthermore, while pointers will always be 8 or 4 bytes, indexes can be as small as 2 or even 1 byte for smaller graphs (reducing structure sizes and potentially leading again to a higher cache hit rate).

As for the costs of indexing, on x86 a single instruction can add the array base, the index, and a constant offset, and do a load or store from/to the resulting address. Other architectures might need a few more instructions, but that is dwarfed by the cost of a cache miss, which can be hundreds of instructions.

Another cost is the bounds check for every indexing into the array, which the compiler can't elide because it can't easily prove that the index is within the array bounds. That is the main reason you saw "unsafe" code on the petgraph crate; there are places where the programmer knows the indexes are within the bounds, since they came from a trusted place (the graph itself), but the compiler isn't smart enough to prove it, so the programmer manually bypasses the array bound checks in these cases.

All in all, I wouldn't know a priori which would be faster for a particular use case, pointers or array indices. I'd have to benchmark first.

> Also with arrays it’s expensive to reduce RAM usage after a lot of nodes were removed.

True, the "array indexes" approach is not as good for algorithms which need to remove many nodes (or edges, depending on how they're represented) from the graph. As long as you don't need the indexes to be stable across deletions, you can use a simple trick to make deletions cheaper (move the last element of the array into the newly freed place, so all empty places are at the end of the array), but that trick can't be used if you need the indexes to be stable (because they're referenced from outside the graph).

benlorenzetti · on Sept 6, 2017

Your point about reducing pointer size if the size of allocation pool is smaller makes sense. But I think it is not fair or realistic to imply that simply using an array the size of all RAM starting at zero leads to more fragmentation than referencing all of RAM relative to some other point. Assuming same algorithms working on same sized data sets. The relative locations of objects is not dependent on where the coordinate origin is, although choosing a good zero may make the engineer's life easier.

Const-me · on Sept 6, 2017

https://news.ycombinator.com/item?id=15180649

josephg · on Sept 6, 2017

When I checked out rust a year ago, I tried to implement three things:

- Some OT code. This went ok, but I still have no idea which of the 6 string types I should use for a library like that. I think I ended up settling for Rc<Cow<String>> or something, but it still wasn't ideal. Swift, Go and C all each have a canonical string type.

- First I tried to make a skip list with performance matching the performance of my C implementation. I discovered that even with unsafe there was no way to make a struct with a dynamically sized array at the end, like I can easily do in C.

- Then I tried to make a networked server using tokio. Despite all the hype, adding a dynamic item to the event bus didn't work because it wasn't 'static didn't work. After spending a few hours fighting the borrow checker, I went online and was told that this would get better with impl trait or something.

I'd really like to use rust, but as far as I can tell its not mature enough for what I want. I've started a new server project recently and I'm writing it in straight C, as none of the newcomer C-replacement languages I tried seem good enough to replace C.

pcwalton · on Sept 6, 2017

> This went ok, but I still have no idea which of the 6 string types I should use for a library like that. None of Swift, Go or C have this problem.

There are two string types: a string that owns its contents and a string that references its contents. This is the same as in any language that uses smart pointers for resource management.

Can you name a string type that you think should be removed, and explain why?

> First I tried to make a skip list with performance matching the performance of my C implementation. I discovered that even with unsafe there was no way to make a struct with a dynamically sized array at the end, like I can easily do in C.

Yes, you can. You can make a one element array and allocate and deallocate manually, just as you do in C. The offset method on pointers allows for arbitrary pointer arithmetic.

> Despite all the hype, adding a dynamic item to the event bus didn't work because it wasn't 'static didn't work.

I haven't used Tokio, but couldn't you use a boxed trait?

josephg · on Sept 6, 2017

> There are two string types: a string that owns its contents and a string that references its contents. This is the same as in any language that uses smart pointers for resource management.

There's String, &str, Cow<?>, Rc<?> and other variants. None is canonical. I spent about 2 hours reading documentation trying to pick the right type to use and I think I ended up with Rc<Cow<String>>. But in this instance my strings represent character edits in a document. 90% of the time they're < 5 bytes long. So in 90+% of cases I should be able to avoid allocations and memory dereferencing entirely, and store the string inside the pointer. What I actually want is an efficient version of enum Str { ShortStr(char[X]), Ref(Rc<Cow<String>>) }, but encapsulated behind a common string interface. Coincidentally, this is exactly how the canonical string implementation works in obj-c and (I think) swift. Despite having 6 different options maybe the string type I actually want is buried in Cargo. I'm not sure - at this point I was tired and I stopped trying.

To me this is a classic symptom of a language trying to do too much. Having all this choice is great for systems development, but for application-level development I don't want X different string options. You want 1. And I want it to be good. Having lots of options would be fine if the language was more opinionated - "Unless you know what you're doing you should just use String, which is efficient, immutable, copy-on-write and ref counted. Click here (link to advanced section of book) to read about your other options if you want more control over allocations."

> Yes, you can. You can make a one element array and allocate and deallocate manually, just as you do in C. The offset method on pointers allows for arbitrary pointer arithmetic.

Does it? At the time even with unsafe there was no way to directly call malloc. Maybe I just couldn't find it in the docs, or maybe thats changed now. I spent weeks on and off trying to get it working, including reading the rust unsafe nomicon and writing dozens of linked list implementations. I tried out all sorts of weird ways to allocate and initialize the array. I kept thinking of new ideas, only to find out a critical piece of syntax was missing. In the end I could allocate the struct I wanted but discovered it was syntactically impossible to initialize, or something silly like that. And at that point I gave up. Maybe this problem has been fixed since. And maybe if I spent even more time trying I would have figured it out. But I was tired and I had work to do.

> I haven't used Tokio, but couldn't you use a boxed trait?

I don't know what that is. Frankly I'm still confused why Rc<> didn't work. I got about 6 different answers when I asked the rust subreddit how to fix this. Some people suggested things that also didn't compile. Some people said I should make my object 'static (no thanks). And others said the problem would be fixed when trait impl lands (whatever that is - is that what you're talking about?). This use case is literally the 'hello world' of nodejs code - attach an event handler to an object, interact with local variables each time the event fires. At least as of a year ago the tokio devs clearly thought all network servers only did request/response style interaction. All the examples on their website were either an echo server or an http server. I need streams.

I really want to be able to use rust. But so far my only experience with it has been one of frustration. It seems too immature to replace C as a systems language, and tokio seems too immature to replace Nodejs for network services. Maybe I'll revisit it in a few years, but at this point I'm more hopeful either someone will bolt decent syntax on top of Go a la coffeescript, or that Swift will add language level support for concurrency. (I'd be happy with either async or go's actor model.)

burntsushi · on Sept 6, 2017

> There's String, &str, Cow<?>, Rc<?> and other variants.

To be fair, that's like saying std::string and std::shared_ptr<std::string> are two different string types in C++, and that neither is canonical.

In Rust, String/&str are the canonical string types. String is an owned growable buffer, &str is an immutable slice. That's it. Adding Cow<_>, Rc<_> or Arc<_> to the mix is orthogonal to the specific string type you're using. They are smart pointers and can work with various types other than strings.

> What I actually want is an efficient version of enum Str { ShortStr(char[X]), Ref(Rc<Cow<String>>) }, but encapsulated behind a common string interface.

We couldn't get away with adding this as the standard library string type because it would impose non-zero costs on every use of a string. The use of Rc is particularly grating because it's not thread safe, which means you wouldn't even be able to send strings across threads. That would suck. So then you might want to say to use an Arc---atomic ref counting, thread safe---but that's even more costly.

I'm honestly kind of confused at your feedback here. At first it just sounded like you were bewildered by the various string types---which is a fair criticism, getting strings right is hard and everyone has opinions on what they should look like---but it actually sounds like you knew exactly what you wanted, and were frustrated that the standard library didn't have it. Instead, the standard library gives you a fundamental string type that one could use to build other more advanced string types when you need them.

The typical solution to problems like that is to go out and build what you need and put it on crates.io. Or, use one that already exists. :-) https://docs.rs/inlinable_string/0.1.8/inlinable_string/

Const-me · on Sept 6, 2017

Have you considered C#? It only has a single string class. With async-await, concurrency is fine too.

Not long ago, they open sourced the compiler and a subset of runtime, making it cross-platform: https://github.com/dotnet/core It’s a but tricky to install on Linux, but for me it works OK, at least so far (an embedded TCP/UDP server app).

josephg · on Sept 6, 2017

I haven't, and its a good idea. I wrote a bunch of C# code back in 2007 and I consistently enjoyed it. - It seems like a very pragmatic language choice.

But if I'm going to move further away from the hardware in exchange for some language comforts & quality of life improvements, Elixir is the next language I want to try. I think both its concurrency primitives and immutability rules might be the right language-level defaults.

Const-me · on Sept 6, 2017

> move further away from the hardware in exchange for some language comforts & quality of life improvements

C# has descent native interop, i.e. [DllImport]. On Linux it imports from .so dynamic libraries. When you want to be closer to the hardware, because SIMD, or system calls not exposed to .NET, or integration with third-party C code, it usually works OK.

pjmlp · on Sept 8, 2017

C# has some SIMD support since version 6.

Const-me · on Sept 9, 2017

Very limited support. On x86-64, the only languages that have good SIMD support are C (С++ gets that for free, ‘coz compatibility) and Fortran.

pjmlp · on Sept 9, 2017

There is also D and Object Pascal, unless you won't consider inline Assembly as having support.

Oh and Fortran of course.

But yeah, I also find it sad having to go down to Assembly to make use of them.

kibwen · on Sept 6, 2017

Rust has two string types, not six. If you're referring to the `OsString` type, that exists only for platform interop, and there's no confusion as to when one needs to use it.

bluejekyll · on Sept 5, 2017

Graph in Rust: https://github.com/bluss/petgraph

SIMD in Rust: http://huonw.github.io/blog/2015/08/simd-in-rust/ (yes, still in nightly)

Const-me · on Sept 6, 2017

> Graph in Rust: https://github.com/bluss/petgraph

Thousands lines of code. Large amount of that is unsafe rust, even C++ is safer than that :-)

> SIMD in Rust: http://huonw.github.io/blog/2015/08/simd-in-rust/ (yes, still in nightly)

It was already “still in nightly” a year ago. Also it’s harder to do integer math with it, because type safety: very often, even consecutive instructions interpret these __m128i registers as different datatypes, u8x16 / i32x4 / u64x2 / etc.

kibwen · on Sept 6, 2017

> Thousands lines of code.

Have you taken a look at petgraph? It does quite a lot of things. The same functionality in C would be thousands of lines as well.

> It was already “still in nightly” a year ago.

SIMD is in Rust nightly not because it's immature, but because the Rust developers would rather design a portable interface than quickly standardize a nonportable one. Given that AFAIK neither the C nor C++ specifications include provisions for SIMD and all support is compiler-specific, the only difference between C/C++ and Rust here is that Rust follows a release model that features a nightly channel.

Const-me · on Sept 6, 2017

> neither the C nor C++ specifications include provisions for SIMD and all support is compiler-specific

The support is portable across compilers. You #include <[xepsiz]mmintrin.h>, and you’ll get these SIMD intrinsics as documented on intel.com.

BTW, OpenMP isn’t in the C++ language spec either, doesn’t prevents it from working on most compilers and platforms.

pcwalton · on Sept 6, 2017

petgraph is a lot of code because it has a lot of features. You could equally well criticize C++ because boost is so much code. A simple graph with indices is a lot smaller, and it doesn't need unsafe either.

_ofdw · on Sept 6, 2017

I haven't learned rust because of how annoying the rust evangelism is everywhere I read about it.

quickben · on Sept 6, 2017

I am fluent in c, c++, c#. The performance level difference in implementations in some cases is over five orders of magnitude.

I envy people that don't have to worry or use pointer management. I imagine them coding in rust with one hand while drinking martinies with the other :)

bluejekyll · on Sept 6, 2017

This is exactly how I code, well s/martinies/burbon/. ;)

The compiler makes sure those types I'm seeing double of are legit.

Seriously, there's a reason the moto 'fearless programming' has been applied to Rust.

gens · on Sept 6, 2017

I don't know why it is not common knowledge that different people have different priorities, different ways of thinking about things and, well, that they are different. Not only that but people can even have opinions that contradict their other opinions. I write C, its powerful and that it stays out of the way. I also like assembly for the same reasons. But i also like scheme, that is the most opposite language to C that i know of.

Another reason i like these three languages in particular is that they are simple. There are no added features that would make me look them up, if, for example, i was reading somebody elses code or if i didn't code in a few months.

I like my data to be laid out how i want it and process it how i want it. You can say "but rust lets you do it with this", when in C i "just do it".

The whole memory safety argument is.. i want to say "fine" as people do make mistakes, but tools like static analyzers (aka linters) and valgrind exist.

As for the higher social aspect;

> I don't "tell" people, what language to use.

But you do. If you say to a newbie that their C project is "bad" because it is written in C, be it directly or indirectly, they will take it as if you are telling them to use another language (and when you say that that language is Rust, then.. well..). When the truth is that programming languages don't matter in most cases.

> But I do encourage them to look at new languages, especially ones that fit in the same place as one that they like.

I couldn't agree more, with the first part at least. I like imperative programming more then pure functional, but i did learn a lot by programming functionally in scheme (and by playing with some other languages that i will never use), and now the C that i write is better for it. edit: assembly is a good example of a language to learn just for the sake of knowing it.

Let us mention that learning abstract theory things also influences how we write things in a specific language. Things like the million forms of data structures and various.. ways of doing things (graph theory (that is surprisingly relevant to concurrency), touring machines vs lamda calculus, various ways of sharing state between parts of code, and.. i can't think of more now). In addition to that, how a computer works can also shape how we program. We have limitations, for example it is very important for performance to not trash the cpu cache all the time.

Now to go back to "reality"; Software is (usually) written to be used. Things that matter are what it does and its cpu/memory usage. For a program that is used to, idk, process some text once a month it doesn't matter what it is written in. But if the program is to be used daily on thousands of computers it starts to be more important that it performs well. That is my opinion at least, as the general opinion seems to be "modern computers have so much processing power and gigabytes of memory". Personally it pisses me of when a vital part of a system is hacked together so it "works", where examples are Glib usage in NetworkManager (and many others), and python (i rewrote phwmon in C because it uses too much memory and cpu, maybe some day i'l clean it and post it on github). If something is vital for day-to-day usage of a computer, then it better well not use hundreds of megabytes of memory and 100 times more cpu then it should (note that 100 times more cpu usage for many "system" things is still a tiny amount, but regardless). To paraphrase a quote that i can't find anymore "the best daemon/program is one that does its job but you don't know it is running", where an example would be dhcpcd (currently uses 196kB of memory and it used a total of 0.05 seconds of a cpu core (klogd is even better with 80kB of memory and 0.02sec cpu time)).

Then there are domain specific languages..

EDIT: I would also like to add that as much as we are different from each other, we are also much the same. And that we can change the ways we think, for better or worse (not that anything is either black or white).

bluejekyll · on Sept 6, 2017

Thanks for this post, it's got a lot of wonderful things that we can agree upon.

> I don't know why it is not common knowledge that different people have different priorities, different ways of thinking about things and, well, that they are different.

Obviously there are a lot of things that I will never understand. And people definitely have different priorities. In some cases choosing C (or insert your favorite programming language here) is chosen purely because you've built up a huge amount of experience with it and for the project you're working on you don't want learn something new. That makes a lot of sense. For me the reason Rust is so attractive is for a few reasons, and none of them are about it being zero-overhead, that's just icing on the cake. The Rust design philosophy, type system, threading model, mutability and trait system literally hit every sore spot I've encountered in my career working on distributed systems, especially around concurrency. Things that I started practicing in all of my code, were actually enforced by the compiler, this was so profound that I jumped into it whole hog. To your comment about different ways of thinking about things, it's definitely changed the way I approach problems.

> The whole memory safety argument is.. i want to say "fine" as people do make mistakes, but tools like static analyzers (aka linters) and valgrind exist.

I think back to your comment about priorities, I definitely value correctness over working code (i.e. it can sometimes take longer to get something working in Rust) and performance (I still opt for correctness and working, then go after performing). And clearly valgrind and lints aren't enough for security vulnerable software. But it does really depend on what you're targeting and why.

>> I don't "tell" people, what language to use.

> But you do. If you say to a newbie that their C project is "bad" because it is written in C, be it directly or indirectly, they will take it as if you are telling them to use another language

That's a fair criticism, and I certainly hope I've never said it directly. Here's a recent story I can relate, a friend showed me a really cool embedded system he had built; video, lens, accelerometer, etc. He had gotten it all working, but told me he had wasted an entire day on a use after free bug in his code. I had definitely been strongly encouraging him to look at Rust before this. He showed me what he'd done, it was amazing, I looked at the board type and he could have used Rust, but should he have stopped to learn Rust to do it to save him this one day? Probably not. It took me three hard weeks to learn Rust, but I pushed through and am very happy I did. Should he do this? I guess it becomes a question of how many of those days he thinks he'll run into as he builds in more features... Is it possible he's building the next IoT security disaster? Doubtful as it's a wearable... If he had already known Rust though, I am convinced he could have written the code more quickly and run into fewer bugs, like the one he related.

For the newbies getting started in C, I do think it's a great foundational language to learn. In fact, for no other reason everyone should learn it as it's currently the only reliable lingua franca FFI on most platforms out there. So it's ABI is needed for doing anything between various languages. But at the same time, Rust holds your hand in a way that C does not. It does not let you shoot yourself in the foot in ways that C does, unless you tell it to take the safety off the trigger. So, for newbies getting into systems programming, even though the language can be a little daunting initially, it increases the success rate of producing correct systems code.

> When the truth is that programming languages don't matter in most cases.

Well, clearly in some cases they do, or we wouldn't be having this conversation ;)

> If the program is to be used daily on thousands of computers it starts to be more important that it performs well.

I think we're in agreement through this entire paragraph, and this is where C really excels, but everything you list is also exactly what Rust is really good at. Again, I go back to correctness, this is a huge value of mine in my code. And it's not just the compiler in Rust that helps guarantee correctness, it's also the really excellent #[test] support that is integrated directly into the language. It was the first time I had ever seen that done, where an external testing framework wasn't needed to write tests. It saves so much time in setting up proper tooling and build files, etc. It's freeing.

It sounds like we've had experience with much the same languages, but have come out with some different priorities based on experience. And what person has the right to question another's experiences? Certainly not I.

Jtsummers · on Sept 5, 2017

That's one of the main reasons Ada lost to C and C++ in the DoD world. The other (and more significant) being cost of implementations.

Personally, I prefer my work to be in the language or languages which suit it best rather than making my choice to spite others for a perceived and, usually, non-existent slight.

iainmerrick · on Sept 6, 2017

There is no inherent benefit in C that, for example, a somewhat modified version of Pascal or Algol wouldn't have inherited.

I happen to like C and think it has a lot going for it, but you're definitely right about this. The original Mac OS was written in Pascal (and assembly), not C. And Turbo Pascal was deservedly popular for a good while. In an alternate universe, Pascal rather than C could be the incumbent legacy systems language.

WalterBright · on Sept 6, 2017

Pascal suffered because its standard was woefully incomplete, and every implementation added mutually incompatible proprietary extensions. Turbo Pascal and all the TP developed code died when DOS died.

clouddrover · on Sept 6, 2017

You can revive Turbo Pascal code with Free Pascal. Here's a port guide:

https://www.freepascal.org/port.var

pjmlp · on Sept 8, 2017

Turbo Pascal did not die, it was reborn as Delphi and only decreased its market share because many key developers ended up going to Microsoft, while Borland management decided enterprise customers were more important than indies.

speedplane · on Sept 8, 2017

What year did Borland make that call? I remember my HS programming class in the late 1990s used a Borland compiler and development environment. I don't recall it being "enterprisey" at the time. That being said I was in HS, and likely associated the word "enterprise" with starships more than big companies.

pjmlp · on Sept 8, 2017

When they re-braded themselves as Inprise, 1996.

https://en.wikipedia.org/wiki/Borland

speedplane · on Sept 16, 2017

Hmm... I first touched a Borland compiler in the 1998-99 school year. I had programmed before that, but it was my first time programming a GUI with the Win32 APIs. I made a Battleship game with a very rudimentary AI so you could play against the computer. Not sure if Borland was good or evil at the time, but those memories will certainly endear me to them.

jmull · on Sept 6, 2017

You have to wonder, though, why the enduring big dogs of OSs are written in C.

Classic Mac OS was written in Pascal, but it had to be scrapped and replaced.

iainmerrick · on Sept 6, 2017

I remember using System 7, which I later learned was the result of a big rewrite of much of the OS in C++ rather than Pascal. It had some nice features, and the rewrite was probably a wise decision for maintainability and future expansion, but compared to System 6 it was sloooooooooow

pjmlp · on Sept 8, 2017

Pressure from customers coming from UNIX, and Apple trying to get into UNIX with A/UX.

lomnakkus · on Sept 6, 2017

It's pretty simple: C was already popular -- meaning: at least decent-ish compilers for most platforms.

WalterBright · on Sept 6, 2017

C got popular because it turned out to be ideally suited for DOS, which was by far the dominant target for programmers for over a decade.

For example, C could deal with near/far pointers. No other language could. Early C implementations were also usable on DOS, early [other language] implementations were unbelievable unusable, and believe me I tried.

There were many diverse, usable, and cheap C implementations available.

Koshkin · on Sept 6, 2017

The popularity of C had absolutely nothing to do with DOS. The most popular languages on early PCs were BASIC and Pascal. The popularity of C is linked to UNIX, from where it was brought to DOS. The far/near pointers did not play especially well with C view of pointers as integers of some fixed size. The 32-bit flat memory model made C programming somewhat closer to what it used to be in UNIX.

WalterBright · on Sept 6, 2017

> The popularity of C had absolutely nothing to do with DOS.

You might want to read popular programming magazines from the early 80s, and the attention given to C on DOS. (It was enormous.) At one time, I counted 30 C compilers available for DOS. What other platform came remotely close to that?

> The most popular languages on early PCs were BASIC and Pascal.

BASIC was indeed popular, but generally not for professional programming. Pascal had nowhere near the penetration of C in the early days (1982, 1983, etc.). Microsoft Pascal 1.0 was unusable, the top C compiler was Lattice C.

> The popularity of C is linked to UNIX, from where it was brought to DOS.

Unix was nowhere remotely as popular as DOS.

> The far/near pointers did not play especially well with C view of pointers as integers of some fixed size.

As a DOS C compiler writer, I can attest that near/far mapped very well onto C semantics. The C Standard in 1989 was very careful to not upset that.

> The 32-bit flat memory model made C programming somewhat closer to what it used to be in UNIX.

That was much later. But since you brought it up (!), 32 bit DOS extenders were in wide use on DOS, and were programmed with C. I don't recall Pascal ever existing on them, but perhaps I misremember. C was also popular on the 16 bit DOS extenders, I don't remember Pascal on those, either.

pjmlp · on Sept 8, 2017

As mentioned on some other thread, there was hardly any C being used outside UNIX during those days in Portugal.

It was all about Pascal, Basic, Assembly and of course xBase notably Clipper.

camus2 · on Sept 5, 2017

> For better or worse (personally, I think, for the better), they're starting with JavaScript, Python, Ruby, or even PHP.

These higher level languages are still built with C(Python,PHP) or C++(V8 Javascript engine) though, so C is still a language of choice for that task, and writing a library in C allows integration with all these high level languages at very little cost. So there is still incentive for people using higher level language to write C. I mean, PHP is useless if the goal is to share code with Python or Javascript.

pcwalton · on Sept 5, 2017

There's an incentive to learn assembly language, too. C is useless if the goal is to, for example, write constant-time crypto implementations.

I'm not saying there is no reason for anybody to learn C, just that most programmers aren't learning C anymore.

kahlonel · on Sept 6, 2017

Did you even the read the article? How are you comparing JS, Python (or any language "gaining popularity" in your metrics) to C, when the author clearly describes the set of problems only solvable in C?

pcwalton · on Sept 6, 2017

The point is that those problems are (a) not only solvable in C; (b) not problems that most programmers need to solve. That is why C and C++ are declining in usage.

Koshkin · on Sept 6, 2017

Are they, really?

emmelaich · on Sept 6, 2017

A "somewhat modified version of Pascal or Algol" would be just C with a different syntax.

C is remarkable for what it doesn't have. It sits at a nice local optimum point that a portable assembler must occupy.

panic · on Sept 6, 2017

Is such a safe implementation of C really suitable for systems programming, rather than merely application programming? If we understand system-building as communicativity, then certainly such a system retains communicativity—so long as alien objects can be described to it in a manner sufficient for dispatching the same dynamic checks. If I memory-map a file, say, I can safely access that memory only if the structure and meaning—the bounds and the types, roughly—are described much like those of other in-memory objects. Tools and systems for providing these descriptions are currently lacking—but are a logical extension of the runtime type information already developed in recent work. In the case of file formats, some cases like the ELF example we saw earlier (§5.5) show that the format has already been defined for us, thanks to the manifest layout of objects declared in C.

This is a key point. There are scattered systems for describing the layout of arbitrary binary data—C structs/unions, Erlang binary patterns, ASN.1 Encoding Control Notation, Kaitai Struct[1]—but nothing has really caught on across language boundaries. It's hard not to feel this data format barrier when you're using a C API from another language. We'll need to do something about this barrier if we want a true multi-language system (not just a bunch of awkward C FFIs).

[1] http://kaitai.io/

haneefmubarak · on Sept 6, 2017

Certainly, but for instance, take one of your examples: Kaitai Struct. It doesn't have support for C (at least it's not listed among the languages on its homepage). OTOH, for more complex payloads I've often seen Protocol Buffers used (yes, I know they don't have native C support either but there's lots of good libraries for using `protobuf`s with C).

The thing with FFIs is that above all we want them to be fast and simple. C rules for laying out structs generally means no parsing necessary, with direct access to fixed offsets for everything you want. If you're ever having problems figuring out the layout of a struct, it's relatively straightforward to just dump some simple load/store code into a compiler and have a look at what it does (assuming you can understand assembly at a basic level): https://godbolt.org/g/khGPWA

chubot · on Sept 5, 2017

Related: "Safe Systems Software and the Future of Computing by Joe Duffy" at RustConf 2017.

https://www.youtube.com/watch?v=CuD7SCqHB7k

I summarized this excellent talk here [1], but one of the main points is that compatibility with existing systems is important for adoption. (They learned that the hard way -- by having their entire project cancelled and almost everything thrown out.) He advocates unit-by-unit rewrites rather than big-bang rewrites, just like Kell does in this conference article.

And compatibility with C in Windows should be easier than it is in the Unix world, because the whole OS is architected around a binary protocol AFAIK -- COM.

My sense is that Rust may not have thought enough about compatibility early in its life. Only later when they ran into adoption problems did they start talking more about compatibility.

Also, it seems Rust competes more with C++ than C, and there seems to be very little attempt to be compatible with C++ (although perhaps that problem is intractable.)

Personally I don't think Rust will be a successful C replacement. It will have some adoption, but the Linux kernel will still be running on bajillions of devices 10 years from now, written in C. And in 20 years, something else will come along to replace either C or Linux, but that thing won't involve Rust.

[1] https://www.reddit.com/r/ProgrammingLanguages/comments/6y6gx...

pcwalton · on Sept 5, 2017

> My sense is that Rust may not have thought enough about compatibility early in its life. Only later when they ran into adoption problems did they start talking more about compatibility.

Of course Rust thought a lot about compatibility with C in its early days. I remember fast FFI was in Graydon's very first presentation about the language in 2010. Almost everything about the language changed, but that focus did not.

> Also, it seems Rust competes more with C++ than C, and there seems to be very little attempt to be compatible with C++ (although perhaps that problem is intractable.)

Rust has gone pretty far in wanting to be compatible with C++, with the C++ stuff added to bindgen for Stylo. We've gone further than most other languages. It's not fair to say there's been "very little attempt": we literally couldn't have shipped Stylo to Nightly Firefox without doing the work to bridge C++ and Rust.

From your other post, it seems that one of your main complaints is that Cargo exists instead of having Rust use Makefiles. All I can say is that the reaction to Cargo from Rust programmers is overwhelmingly, almost universally positive, and abandoning Cargo in favor of Makefiles would instantly result in a fork of the language that would take Rust's entire userbase. Not solving builds and package management is not a realistic option for a language in 2017.

wahern · on Sept 6, 2017

Following the logic of the article, Rust has made the exact same mistake every other language has made, which is to conceptualize compatibility with the C ecosystem has merely an issue FFI. Rust is hardly the first language to focus on easy FFI from day 1, but according to the article that's not nearly sufficient. And like most other modern so-called systems language, Rust hasn't gotten around to committing to a stable, exportable ABI. In fact, I think much like Go the general sentiment is that this is largely undesirable, as stable ABIs can cripple evolution of the implementation, especially those that rely on sophisticated type systems.

pcwalton · on Sept 6, 2017

> And like most other modern so-called systems language, Rust hasn't gotten around to committing to a stable, exportable ABI.

That's not true. The C ABI is stable and exportable, and you can opt into it on a per-function basis. We do that for integration with existing projects all the time.

Again: All of you are talking as though the idea of integrating Rust into a large C++ project is some far-fetched theoretical idea, and that we made some obvious mistakes that make this goal impossible. In fact, we're shipping an integrated Rust-C++ project today: stable Firefox, used by millions of users.

wahern · on Sept 6, 2017

I'm not arguing that it's too difficult integrate Rust with C or C++ projects. I'm simply trying to get at the distinctions that the article is making, which are rather subtle.

One aspect of Rust that fits well, IMO, with the characteristics the article argues are under appreciated is its emphasis on POD--objects as compact, flat bytes. That puts Rust much closer to achieving what C does best (again, according to the article), which is first-class syntactic constructs over memory--namely, pointers. But it falls short in the sense that to _export_ Rust objects (rather than import alien objects into Rust) you have to do so explicitly. And presumably the author would argue that Rust is significantly undervaluing the benefit of a stable ABI that would allow other applications to import Rust objects without an explicit language-level construct (i.e. explicitly annotating APIs with no_mangle).

Obviously when you're building a large application, cathedral style, the requirement to explicitly annotate is not only less burdensome, but quite useful (for many reasons). But in a larger, heterarchical ecosystem of software, that's actually quite limiting. Our first instinct is to argue that permitting such unintended peeking behind the curtain is dangerous and unnecessary, but the article speaks directly to that.

Imagine a Rust with a stable ABI that was exported via Sun's CTF format. CTF is like DWARF but much simpler (and thus little incentive to strip it), and it's being integrated into both OpenBSD and (I think) FreeBSD to facilitate improved dynamic linking infrastructure. Rust could even, theoretically, continue randomizing member fields. And this data could be consumed by any language's toolchain, not simply Rust's toolchain. That sort of language-agnostic, holistic approach to interoperability is largely what I think the article is getting at.

pcwalton · on Sept 6, 2017

I'd be all for a standard language agnostic ABI. I'm not on the language design team anymore, but I suspect you wouldn't have any trouble convincing them to get on board with such a thing either. The ones you'd need to convince would be the C++ folks, I suspect :)

chubot · on Sept 6, 2017

Yes, that is what I was referring to. Calling sin() is not enough. It's messy but C programs need more than that.

And I was also referring to the similar issue in Go where calling C -> Go and Go -> C isn't symmetrical. Not sure if that's true for Rust or not.

pcwalton · on Sept 6, 2017

> It's messy but C programs need more than that.

Of course they do. That's why Rust has a sophisticated tool, bindgen, which is used in production right now in Nightly Firefox (among other places) to export complex C++ interfaces in both directions across the language boundary.

> And I was also referring to the similar issue in Go where calling C -> Go and Go -> C isn't symmetrical. Not sure if that's true for Rust or not.

It's not. You just write "#[no_mangle] extern" on your function Rust and C can easily call it, with a stable ABI.

In order to meaningfully criticize Rust's FFI, you need to be aware of how it works.

chubot · on Sept 6, 2017

Well, just saying it has fast FFI doesn't tell me much. Being able to wrap something like sin() was in Python 1.0, but most applications need more help than that. There have been 5+ popular systems since then trying to make the experience better... it still is barely solved.

That said, I admit I'm more on the pessimistic side. Having touched Go before it's open source release in 2009, I didn't think they thought enough about integration either. I think it was worse than Rust, because you couldn't call Go from C or C++, unless the main program was in Go.

Also their build system isn't used inside Google. And they do nontrivial stuff with signals and threads.

But Go seems to be being adopted. However there is an important distinction: Everybody is rewriting new versions of Google-style servers in the open source world. But all the stuff at Google is still in C++.

So I think nobody ever rewrites old software. They write new versions of similar things, and then hopefully those new things get adopted. But the old thing will probably be around for a long time too.

And to be fair C didn't replace Fortran or Cobol either -- scientific applications still use Fortran and old banks (apparently?) still use Cobol on mainframes.

Maybe that's the most you can expect. But in that case there still does need to be a "plan" for making existing C code like the Linux kernel and OpenSSL safer. I think my issue is that some people apparently think that plan involves Rust when it doesn't. Maybe the core team has never pushed that idea but some other people seem to be under that illusion.

-----

This is a different argument, but a language only needs to "solve" package management if it always assumes it has main(). I was looking for something more humble that you could stick in a file in an existing C or C++ project, e.g. for writing as safe parser.

Also the 5+ different Python + C/C++ solutions now need a Python + Rust analogue. For a language at the Rust layer, there's this O(m*n) problem or strong network effect to deal with.

Actually that was thing I was thinking while reading this PDF -- a lot of it can be boiled down to "C and C++ have network effects". Particularly C++.

Asking Rust to break the network effect is like asking Apple to break the Windows monopoly with Mac OS X. That didn't happen -- they built the new thing iOS, and beat Windows with that. So then the question is if Rust is more like OS X or iOS.

pcwalton · on Sept 6, 2017

> So I think nobody ever rewrites old software. They write new versions of similar things, and then hopefully those new things get adopted. But the old thing will probably be around for a long time too.

That's very true. The most we can hope for is that Rust and other languages, such as Go and Swift, continue to chip away at the market share of C and C++. It'll be a long process.

I'm not a "rewrite everything in Rust" booster; as much as I would like to, that won't realistically happen. Instead, I see Rust as another player in the "programming language Renaissance" that has been going on since the mid-2000s. C and C++ are losing their dominance and instead are becoming part of a broad ecosystem of languages. And that's great: the fact that we have so many choices in languages now has been a very good thing for productivity and security.

> Actually that was thing I was thinking while reading this PDF -- a lot of it can be boiled down to "C and C++ have network effects". Particularly C++.

I agree. That's why I think this paper overanalyzes the success of C and C++. They became dominant because of network effects: simple as that.

wahern · on Sept 6, 2017

I think the article helps to explain why C was able to leverage network effects so well. Neither C nor Unix came out of the gate in a dominating position. Indeed, it's arguably only in the past 20 years that it clearly dominated. Fortran, Pascal, and a bevy of other languages were at times much more widespread and influential. Even today C isn't the most used language. And yet it's influence continues to be outsized.

C isn't just a language, it's an entire ecosystem of toolchains and software that facilitate network effects. "Chance" is far too convenient an explanation. No doubt chance had a significant role, but if C were as useless, unsafe, and devoid of redeeming qualities as many people argue, then I don't see how C could have benefited so strongly from network effects.

pcwalton · on Sept 6, 2017

C wasn't useless and unsafe at the time it became dominant. It was quite state-of-the-art at the time. We've just learned more about what works well and what doesn't in programming languages since 1978, which is why C is no longer as dominant as it once was.

ktRolster · on Sept 6, 2017

It was quite state-of-the-art at the time.

Most of the criticisms you hear about C today (type safety, memory safety, no garbage collection) were criticisms that C got when it was first invented.

In fact there are fewer criticisms of C than there used to be: a lot of the early criticism centered around syntax, but the C syntax kind of won, so you don't hear that anymore.