Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> (It's also a good example of how pointer-arithmetic and nul-terminated strings should be used, but the less said about that, the better).

Why the less said the better?

I'm always disappointed when people say C is bad at string manipulation. The "character at a time, no allocations needed, stop when you hit the end" style demonstrated here and elsewhere have always resonated deeply with me.



Roghly speaking, I don't want to start a flame war by defending pointer-arith. The haters are right that languages could use other, safer features (like array slicing).

But if it is how your language does things, then use it, and use it well -- as in this example. Also I find counted strings preferable for most tasks, but nul-terminated have the advantage of working so nicely with the ptr++ operation.

There was real insight behind C's minimalistic string handling. And C's minimalism is has always been its best feature. But today's minimal languages should support strings explicitly.


> Also I find counted strings preferable for most tasks, but nul-terminated have the advantage of working so nicely with the ptr++ operation.

Is this:

    while (foo != '\0') {
        *bar++ = *foo++;
    }
really that much nicer than this:

    while (i != len) {
        bar[i++] = foo[i++];
    }
?

Or even this:

    for (; i != len; i++) {
        bar[i] = foo[i];
    }
?

Don't do this, though:

    while (i != len) {
        bar[i] = foo[i++];
    }
(that's the road to undefined behavior... better watch out for them sequence points!)

> There was real insight behind C's minimalistic string handling.

There really wasn't, though. The simplicity of C's string handling is deceptive. It's full of nuances and inadequacies when used the way the Good Book* describes, and that's the way it's usually used. You're really better off rolling your own string abstraction, or using some library such as bstrings, or even compiling as C++ and using std::string instead (if your C happens to otherwise compile as C++, that is).

* K&R


"Is this: while (foo != '\0') { bar++ = foo++; } really that much nicer than this: while (i != len) { bar[i++] = foo[i++]; }"

Not visually, but the latter increments i twice in every loop :-). It also typically uses an extra register and requires an explicit 'compare with zero'. Both mattered more 40+ years ago.


> the latter increments i twice in every loop :-).

D'oh! Good catch. That's what I get for coding without testing at 3am... Just realized `foo != '\0'` should be `* foo != '\0'`, too. Derp.

> It also typically uses an extra register and requires an explicit 'compare with zero'.

Okay, I see the extra register, but I'm lost on any extra comparisons versus the pointer-arithmetic version. The pointer version still has to test for the nul character anyway, right?

> Both mattered more 40+ years ago.

Indeed, but I'm talking about today. We can now afford the safety that comes with using explicit lengths rather than sentinels. (I'm inclined to think Dennis Ritchie could've afforded it, too, considering that Pascal (among others) was using explicit lengths for strings (and indeed all arrays) around the same time C was invented, and did so on similar hardware to boot. Bounds checking is pretty damn cheap, all things considered, even on a PDP-11 (the PDP-10, while not exactly a predecessor to the PDP-11, ran TeX written in WEB / Pascal, and was favored by some Lisp hackers as practically a Lisp machine (in fact, Lisp was the first thing DEC brought up on the PDP-10, or so I'm told...), so I think it could afford bounds checking, at least).


"The pointer version still has to test for the nul character anyway, right?"

Some architectures set the zero flag on memory loads. I think the PDP is one of them. Google gives me https://groups.google.com/forum/m/#!topic/net.lang.c/LKAYbzX..., which claims that

    while (*ra++ = *rb++);
    
compiles to

    L16:
        movb        (r10)+,(r11)+
        jeql        L17
        jbr         L16
    L17:
If you do i != len, you need an extra comparison instruction.


Ah! Interesting. Thank you :)


> I'm always disappointed when people say C is bad at string manipulation. The "character at a time, no allocations needed, stop when you hit the end" style demonstrated here and elsewhere have always resonated deeply with me.

The main reason is that C doesn't really offer many primitives for string manipulation beyond pointer manipulation. All you really get is the stuff in `string.h`, which isn't much, and most of those functions are deeply flawed in one way or another:

* strcpy() can easily lead to buffer overflows if the destination array is not large enough to hold the entire string

* strncpy() will not nul-terminate the destination string if the size of the source string exceeds n

* ditto for strcat()/strncat()

* strtok() is not re-entrant, and therefore not thread-safe. The C11 standard is explicit that strtok() is not required to avoid data races with other calls to the function.

* You're not allowed to modify the result of strerror(), yet the function returns a regular ol' char* and, as such, the type system can't prevent anyone from doing so.

And perhaps the biggest problem of them all:

* strlen() potentially diverges (i.e., it can loop indefinitely or overflow the buffer, which causes undefined behavior[0]), making it literally impossible to reliably validate a C string in standard C without knowing the length of that string a priori. It is also the case that the length of a string is not necessarily equal to the size of the buffer it is stored in, which means you now potentially have a third thing you need to know a priori (the buffer size).

The small size of C's string library means you get to roll a lot of your own string functions, including the kinds of primitives you'd expect. Even QBasic has a larger (and arguably better) string library. Alternatively, you can just forgo the functions and write loops through strings everywhere, but then as soon as you want to support some other character encoding (or multiple character encodings), you're in for a boatload of fixes throughout your codebase.

The nature of C strings being represented as arrays of characters rather than an abstract string type means that it brings you the joy of manual memory management everywhere. You need to constantly be allocating space for strings, deallocating buffers, keeping track of ownership, copying strings between buffers, etc. Of course, since the length of the string isn't included in the string itself, if you want to use the safe string-handling functions, you need to keep track of the lengths separately. This usually means returning a `size_t` for the length of the string, while the output of the string functions goes to a pointer ("sorta pass-by-reference") "out" parameter.

Strings that rely on sentinel values are themselves pretty flawed. Continuing the note above, it's safer and more convenient to use strings that are somehow bundled with their length. With sentinels only, reasonable things like nul-padding the strings for alignment purposes can also be a pain in the ass, since C considers the end of the string to be the first nul character it encounters. An abstract string type would allow you to store the length of the string and the size of the buffer in the same package as the pointer to the buffer (and possibly even the buffer itself). This can have advantages other than easier handling of padding as well, allowing you to grow or shrink the string within the same buffer without (re)allocating memory and can potentially save some copying. If you can sacrifice a byte, you can retain interoperability with functions expecting C-style strings while enabling better string handling by including the nul-terminator in addition to bundled the length (Lua does this, for example). A common complaint is that this places an upper bound on the length of a string, whereas a sole sentinel value does not. This does not have to be the case if a varint[1] is used to hold the length.

C strings are also woefully inadequate for dealing with character encodings larger than 8 bits, but that's another can of worms.

In short, C's standard strings are fine, really, but they could be so much better with a touch of abstraction. Fortunately, it's possible to make some improvements from within C itself, but unfortunately that's often a mild pain in the ass, and not all programmers bother to do it.

---

[0]: Of course, once your program enters undefined-behavior territory, it's off the chain and can do anything it wants, regardless of what you intended it to do, even if your program is 100% correct in cases where there is no undefined behavior.

If you're lucky, a buffer overflow will cause the program to crash and die immediately, due to segfault or something. If you're unlucky, it'll go haywire and run amok. If you're really unlucky, it'll continue to operate with no outward signs of problems, silently doing something horrible and nasty without any indication to the contrary.

[1]: https://en.wikipedia.org/wiki/Varint


strncat() is not like strncpy() - it always nul-terminates the result.

I think your strlen() issue is a bit overblown. Either you get the string from a function that explicitly produces valid strings, in which case you know that it's a valid string already; or you get it from a raw bunch of bytes which tends to come with a size attached, giving you all the information you need to either check if it's a valid string or append a nul-terminator yourself.


> strncat() is not like strncpy() - it always nul-terminates the result.

Yes, you're right. That's what I get for trying to speed-read the C standard at 3am...

> I think your strlen() issue is a bit overblown. Either you get the string from a function that explicitly produces valid strings, in which case you know that it's a valid string already; or you get it from a raw bunch of bytes which tends to come with a size attached, giving you all the information you need to either check if it's a valid string or append a nul-terminator yourself.

Usually, yes, but neither of those things are guaranteed, unfortunately, and thus we wind up with the perennial buffer overflow exploit.


The issue of strtok() being horrible hack is easily overcame by using str(c)spn() (and optionally bunch of pointer arithmetic).


Of course. All of the issues with C's string library can be overcome rather easily, if only because it's so very basic.


The problems you describe are all well understood to C programmers and hence not really a huge issue. You are severely overstating their difficulty.


I don't mean to imply that they're difficult to overcome. I'm just trying to explain why people often say that C has poor string handling. Many modern languages provide more comprehensive string libraries, and the ones that are provided tend not to suffer from such problems. It's not that C is incapable of doing string handling (it's capable, yes), but when you constantly find yourself tossing out C's string library altogether and rolling your own instead, it becomes obvious that C's provided string handling functionality is inadequate for many use cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: