Writing a custom iterator in modern C++

quietbritishjim · on Dec 21, 2020

The classic iterator categories (input, output, ...) discussed in this article have been deprecated in C++20, and replaced with new concepts. I would expect a post from 2020 claiming to be about iterators "in modern C++" to be specifically about that change, but actually it's just about the legacy categories, so this is a pretty worthless article.

pjmlp · on Dec 21, 2020

On the contrary, this article can be used by the audience in almost any modern C++ compiler out there.

An article using C++20 will just lead to frustration, as there are still rough edges using concepts, with a standard that was just published last week.

quietbritishjim · on Dec 21, 2020

Sure, this is the technique that you want to use for programs that you write today. But given that it's using the iterator concepts that have existed for decades, that will be deprecated soon if not now, why claim it's "modern C++"? The only thing I could see in the whole article that wouldn't work in C++98 was range-based for loops. At least they didn't use std::iterator, which was deprecated in C++17.

The title is misleading marketing: It gives the impression you need this nice new blog post instead of all the old out of date ones, and that it's all you'll need to know for the foreseeable future.

pjmlp · on Dec 21, 2020

Well, regarding "modern C++", I have seen that moniker applied to even language features that were already available in pre-ISO C++98, like RAII, or type safe collections.

Apparently too many people still keep using C++ compilers to compile C code, not surprising when there are schools still using Turbo C++ for MS-DOS as teaching tool, as per one of Bjarne talks.

Back to C++20, C++ is no longer my main tool, but I still follow up on it and one of my hobbies is checking new standard features, hardly everything works at is supposed to be and I expect the usual three years for stabilization, so when C++23 will be around the corner, is when I expect all major compilers to fully support C++20 without triggering some kind of compiler error.

gpderetta · on Dec 21, 2020

> regarding "modern C++", I have seen that moniker applied to even language features that were already available in pre-ISO C++98, like RAII, or type safe collections.

That's not surprising as modern C++ predates C++11. Newer standards "merely" make it more convenient to practice it.

At the end of the day, modern c++ is really about emphasizing value types, RAII and parametric polymorphism as opposed to OOP.

Anyway, these days we have it good conformance to new standards comes relatively quickly. It took forever for most compilers to be C+98 conformant (and technically most never reached it).

cma · on Dec 21, 2020

It is mentioned in the article early on:

"The first thing to do is to assign the iterator some properties. Until C++17 this is done by tagging it with the tag dispatch mechanism, while C++20 uses concepts: in this article I will follow the traditional approach."

DoofusOfDeath · on Dec 21, 2020

Honest question: what does it mean for something to be "deprecated" in (specifically) the c++ language spec?

millstone · on Dec 21, 2020

The spec itself defines it as:

Normative for the current edition of the Standard, but having been identified as a candidate for removal from future revisions

This is a very blunt tool. In practice, a feature may be deprecated if it is:

1. Legacy inherited from C that serves no modern purpose. The register keyword.

2. Features hardly anybody implemented. Exported templates is the only example I am aware of. Like falling off the edge of the map.

3. Features that proved to be a bad idea, like incrementing a bool.

4. Features that proved to be wholly inadequate, like codecvt for dealing with Unicode.

5. Features that are reasonably new in C++ but obviated by new stuff. You don't need std::bind if you have lambdas.

This is my outsider's perspective; please chime in if you have standards committee insight.

ncmncm · on Dec 21, 2020

strstream was adopted into C++98 with conflicting implementations, so was deprecated at the outset. You are warned not to use the conflicting parts in code meant to be portable.

strstream has not been removed, yet, because spanstream is not in yet. Until that is in, there are certain things only strstream can do: specifically, attach to an existing buffer you need to parse things out of, or to format things into.

DoofusOfDeath · on Dec 21, 2020

Thanks for the explanation. Given my impression that the standards committee places very strong emphasis on backwards compatibility, I didn't expect there to be a policy to allow for breaking changes.

rualca · on Dec 21, 2020

> I would expect a post from 2020 claiming to be about iterators "in modern C++" to be specifically about that change (...)

...and yet here I am, still struggling to get some of my employer's C++ projects to build with C++14.

pumpkinpal · on Dec 21, 2020

Somewhat related but C++20 ranges do not replace iterators and instead build a higher level of abstraction upon iterators. Ranges are implemented in terms of iterator pairs.

dataflow · on Dec 21, 2020

Yeah. That actually sucks; often iterators doesn't fit the iteration model of some sequences, but ranges would fit right in if they didn't require separability into two iterators.

stkdump · on Dec 21, 2020

Since C++20, the end iterator doesn't have to be the same type. Instead it can be something like a tag with no own runtime data, and comparing with it can be implemented as asking the other iterator if it is 'out of data'.

dataflow · on Dec 21, 2020

Yeah that's definitely an improvement in some respects, but (a) generic code still has to deal with 2 iterators, (b) generic code now has to allow them to be different types (so old code will have to be rewritten), (c) it doesn't change the fact that iterators are still treated like cheap objects (generalized pointers) and hence the begin iterator is still potentially expensive to store and copy around regardless of what you do with the end iterator, and (d) it's still a fundamental model mismatch relative to the actual problem, just not quite as unpalatable as before. Ranges fundamentally change the calculus around iteration.

P.S. (e) Is this sentinel-based approach even composable? If your iterator needs to use other iterators underneath, what do you do? You have to store a 'full' iterator anyway... or now you waste even more space and time storing a discriminated union. The fact that a range represents the whole sequence in one object instead of two makes composition intuitive and trivial.

flqn · on Dec 21, 2020

Regarding the comment on discriminated unions, I've actually found the opposite with the new ability to have separate types for sentinels. Before, since both iterators had to be the same type, the end iterator needed some special casing to know it was the end. Allowing it to be a separate type meant all the logic for traversing the underlying sequence could be in the iterator type, and the sentinel could be an empty struct like a tag type used for comparisons only.

pumpkinpal · on Dec 21, 2020

The iterator and sentinel do not need to be the same type, they just need to be equality comparable. If you wanted to represent a infinite range you could implement it in terms of a regular iterator and an empty sentinel type to which the regular iterator compares false to.

dataflow · on Dec 21, 2020

I'm not sure why you're repeating the comment I replied to, but this is exactly what I replied to in my comment.

cma · on Dec 21, 2020

It isn't a sentinel is it? He said no runtime data, so it sounds like some kind of empty tag for the type system. I'm not sure if it is related to what he is describing, but C++20 also added [[no_unique_address]] for something to do with truly empty members.

dataflow · on Dec 21, 2020

Distinguishing two members of a union with... 0 bytes of extra data? Sounds like magic.

Not 100% sure about the terminology. Pick whatever word might fit it better. I just called it a sentinel.

cma · on Dec 22, 2020

I thought he was saying the `end` in range begin/end was now allowed to be some empty signifier to tell things to look for the null pointer (null is a sentinel, but `end` itself is now empty). That may not be right, I don't know enough about the new concepts stuff and range seems to be part of that.

edit: looking at this it maybe covers it:

https://foonathan.net/2020/03/iterator-sentinel/

dataflow · on Dec 22, 2020

Yes but I'm saying this is kind of moot when you're writing a struct that needs to store the iterators, because it can't get away with just storing the 'begin' -- you can't guarantee the second iterator you're storing will actually be the end. Meaning this iteration model isn't efficiently composable.

stkdump · on Dec 21, 2020

Why would you need a union?

dataflow · on Dec 21, 2020

Write me some code that shows me how to compose iterator-based ranges with just 1 iterator's worth of space in each range.

stkdump · on Dec 21, 2020

It is not so clear to me what you mean. Maybe you can point to a better version (maybe what some other language does), and I can explain how I think the same can be done with C++20 ranges. I believe ranges are strictly more powerful than what other languages provide, and often they can be slimmer/more efficient for the common case.

tlb · on Dec 21, 2020

Can you give an example of such a sequence?

dataflow · on Dec 21, 2020

Any kind of generator is basically like this. Imagine one for listing the contents of a directory... maybe it looks like this:

  class DirectoryChildren
  {
    void *dir;
    void *buffer;
    size_t buffer_size;
    size_t buffer_offset;
    ...
  public:
    bool next(FileInfo *info);
  };

Bear in mind the buffer only represents a chunk of intermediate results fetched from the OS; when we reach its end, we have to request more data from the OS.

If you had to make an iterator for this, what would you do? You'd basically need to either (a) create iterators that at least duplicate the entire state of DirectoryChildren on the stack, or (b) make iterators allocate on the heap. Both of these suck, because (1) iterators are passed around and copied all over the place on the assumption that these are fundamentally cheap operations, which is an assumption that easily breaks (like here), and (2) the iterators fundamentally do not have any meaning or utility as separate entities from the range itself.

pumpkinpal · on Dec 21, 2020

This functionality already exists in the C++ standard library as std::filesystem::(recursive_)directory_iterator with the default constructor creating a sentinel value. This is solution (1) and this is due to the restriction that prior to C++20 iterators were required to be the same type. C++20 relaxes this restriction and allows the sentinel to be any type as long as it is equality comparable with the other iterator. Since on *nix systems readdir is used under the hood, the sentinel can simply be an empty type that is only equal when the actual iterator's last cached dirent pointer is equal to NULL. The sentinel itself can be an empty type.

dataflow · on Dec 21, 2020

Why do you keep repeating the same comment? Are you reading my replies? I already addressed exactly what you're talking about: https://news.ycombinator.com/item?id=25492756

arcticbull · on Dec 21, 2020

There's six different kinds of iterator?!

Off hand, is it just me or...

- Isn't an iterator that outputs data to its recipient container not an iterator, but rather some type of filling API?

- Isn't a random access iterator not an iterator but rather a slice view?

- Isn't a contiguous iterator a leaky abstraction in that an iterator is just an interface that shouldn't expose how the data backing it is stored?

This iterator kindedness feels very C++ in the worst way possible. IMO the only iterator type that makes any sense is an "Input Iterator" and everything else is some sort of hot mess of inconsistency and, uh, C++.

Even with enough time I'm not sure I could design a worse API.

gpderetta · on Dec 21, 2020

> Isn't an iterator that outputs data to its recipient container not an iterator, but rather some type of filling API?

Sure, but it makes sense for this api to have the same interface as an iterator so that existing iterators can be used without an adapter.

> Isn't a random access iterator not an iterator but rather a slice view?

It is an iterator that supports additional functions. You can use a random access oterators with functions that take output iterators, forward iterators, multipass iterators etc. Of you had untelated concepts for each use case the system would be more complex.

> Isn't a contiguous iterator a leaky abstraction in that an iterator is just an interface that shouldn't expose how the data backing it is stored?

Seeing thought abstractions and specializing algorithms is what makes STL style generic code efficient.

And it is not inconsistent at all. Stepanov has written in grat details about the theoretical fundations of his design.

millstone · on Dec 21, 2020

Iterators are inconsistent, but this plays to the strengths of C++'s duck-typed generics. The same code can either copy or insert, depending on whether you use std::begin or std::back_inserter; this isn't principled but it's powerful.

The "six types of iterators" overstates it. You don't need to sweat those details unless you are writing boost or something. Just add what you need.

I miss C++ iterators in other languages. For example, in Rust it's hard to talk about "a position in a String" which may be advanced or retreated. Usually you give up, use an integer index, and this is much worse.

Tyr42 · on Dec 21, 2020

Backing up a utf8 stream is nontrivial though.

millstone · on Dec 21, 2020

This is an interesting point. Walking backwards through a UTF8 byte stream is easy: just look for the next non-continuation byte, and Rust really can do this, though the API is buried under `str.chars().rev()`.

C++ is structured around iterators that point into; this enables e.g. writing into the guts of a std::map but not the guts of a std::string if you care about Unicode, since the write might change the string's byte length.

shakna · on Dec 21, 2020

> Isn't a random access iterator not an iterator?

Why would it not be? It still allows you to iterate over an entire collection, however the order is left undefined so that certain performance characteristics can be exposed to the underlying implementation.

This is the kind of tool you'd reach for if you don't care about the order of a collection, but you do care about performance, and probably care about the entire collection.

For example _pairs_ from Lua, used to iterate over key/values in tables, is a random-access iterator.

Traversal of the collection is still happening.

millstone · on Dec 21, 2020

Random access iterators mean you can efficiently advance or retreat an iterator by an arbitrary amount. So a vector (via pointer arithmetic) but not a linked list.

arcticbull · on Dec 21, 2020

Right but why should this matter at the type level? All you need to implement is next() and annotate it with documentation of the performance characteristics.

millstone · on Dec 21, 2020

Yeah, this is a good question. Scott Myers has a talk where he pokes fun at the "guaranteed O(N log N)" sort of a linked list, where the guarantee is in number of comparisons and not the (very large) number of pointer chases.

Why describe at the type level the difference between a backwards pointer chase and a pointer decrement? The hope is: if you know at compile time that your iterator is efficient in this way, you can pick a more specialized algorithm, say, quicksort instead of mergesort. C++ really does enable this at the type level through template specialization. In practice the STL collections haven't lived up to this, and I think the iterator tags are mostly deadweight.

arcticbull · on Dec 21, 2020

Thanks, that was helpful. I suspect the reason it ends up being dead weight is that it doesn't make much sense to prevent an API customer from using an iterator with certain performance characteristics with certain algorithms. Ultimately, the onus is on the developer to know something about the data; for instance, why wouldn't I be allowed to use any iteration algorithm on my slow-to-increment iterator if I know the length is 1?

So long as it's safe I think it's on the designer to make sure it's efficient, and I suspect that's where the API has gravitated to.

missblit · on Dec 21, 2020

You can think of it as implementing `seek`, not just `next`.

`RandomAccessIterator` implements the `+=` and `-=` operations that `BidirectionalIterator` does not. Algorithms that accept `RandomAccessIterator` can then assume that calling += is possible.

arcticbull · on Dec 21, 2020

What's the advantage to using a random access iterator over a get() indexed? Like to me the point of an iterator is that it abstracts over access kind, and exposes the minimum viable view, which is sequential forward and possibly backward access.

If you want indexed access why wouldn't you just, you know, use anything that implements operator[]?

missblit · on Dec 21, 2020

Just that iterators is the API that C++ settled on for doing this in a generic way (well concepts in C++20, but I haven't learned that yet).

It's totally 100% kosher to write a function that calls `.at()` instead of `operator+=`. As long as it only needs to work on vectors, strings, or other stuff with a `.at()` method.

But if the function works on iterators instead it can work on ANY data structure that also fulfills the requirements of RandomAccessIterator API. String Views, raw pointers, a 3rd party Cord library, etc.

There's no inherent meaning to the iterator API requiring `operator+=` instead of something closer to `get()`, except that it means that the iterator itself doesn't need to know anything about container bounds (like I said above, raw pointers work as RandomAccessIterators).

Too · on Dec 21, 2020

[] alone doesn't imply that the view is sequential. Binary search for example requires both sequential items and indexed access. You can't binary search an std::map even if it implements [].

josefx · on Dec 21, 2020

You need an additional API to create views/sublists if you want to transparently operate on sub ranges and if you want to return a found index you may need to convert its value to apply to a parent view.

arcticbull · on Dec 21, 2020

The way the random access iterator was described in the article is not that it allows iteration over an un-ordered collection but rather that it allows random access within the collection. This would not necessarily map well to an un-ordered collection. This appears to be born out by the API which supports arbitrary addition and subtraction to the iterator. I could of course be mistaken.

What you're describing based on my experience with other languages is simply an "input iterator" over an un-ordered collection. An iterator does not guarantee deterministic/repeatable ordering, just defined ordering (as in, guarantees it won't visit the same element twice), in my experience.

glouwbug · on Dec 21, 2020

Well, I mean, iterating over a deque, map, vector, and list, in its own, already require the pointer to the next element to decide whether to increment by the type of its size, reference the `next` pointer, or the left/right pointer, or a combination of any of the above 3. It's not the language's fault, it's computer science

quietbritishjim · on Dec 21, 2020

You seem to have mixed up an objection to the API with an objection to the user of the word "iterator" for something a bit more general than the normal/original meaning. That conflation isn't terribly helpful. Words can have more than one meaning. So long as everyone's on the same page, that's OK.

creato · on Dec 21, 2020

It's funny how two of the most common bandwagon posts you can see on HN and other programming venues is:

* C++ is a terrible bloated language that no one should use.

* The power of modern computers is wasted on lazy software engineers writing software with too many layers of abstraction.

dataflow · on Dec 21, 2020

I mean, C++ has lots of compile-time abstractions, but unlike with other languages, it can collapse a lot of them during compilation via its inlining and other optimizations. You don't get that with a language like Python...

adwn · on Dec 21, 2020

I don't see a contradiction between these two statements. There are more languages than C++ and Python. You could use Rust, or some of the languages targeting the JVM or the CLI.

david2ndaccount · on Dec 21, 2020

To many, these statements are saying the same thing. Your average C++ codebase is a bloated mess with too many layers of abstraction.

mzsmyk · on Dec 21, 2020

You don't need to use a "terribly bloated language" to write efficient software.

_td9r · on Dec 21, 2020

C/C++ may be better understood as a template processing system around assembly.

robscallsign · on Dec 21, 2020

Keep in mind that most of the details you mentioned are for the developer of the library, not the end user/consumer of the library.

Ultimately most consumers of the library care if they can call std::sort(YourSpecialContainer.begin(), YourSpecialContainer.end()); in a consistent manner with some semblance that the use of templates and interfaces result in nearly zero overhead of the abstraction.

vardump · on Dec 21, 2020

"nearly zero overhead"

I can't call anything "nearly zero overhead" that can generate large amounts of code. Not at least with a pure conscience.

I've seen enough many cases of better performance with more "overhead" but drastically smaller code size.

flqn · on Dec 21, 2020

Code size is not always a good performance metric

vardump · on Dec 21, 2020

Nothing is a good performance metric apart from profiling. Preferably with runtime conditions (CPU load, memory bandwidth, etc.) closely mirroring production setting(s).

It's very easy to forget things like memory bandwidth, inter-core/CPU links, etc. are all limited resources. And that on real systems resources are shared.

benibela · on Dec 21, 2020

I usually profile everything with valgrind callgrind

benibela · on Dec 21, 2020

Usually people say std::sort is faster than qsort

vardump · on Dec 21, 2020

It's true whenever the function call overhead for the comparison operation dominates. Inlining a trivial operation (often just one instruction) is one extreme.

At the other extreme L1C cache (typically just 32 kB) gets constantly trashed by numerous type specialized functions. Individually fast when microbenchmarked, but collectively slow.

That's why profiling is a must when high performance is required. Microbenchmarks can be very misleading.

gpderetta · on Dec 22, 2020

Why would all template specializations for be in L1?

Even assuming you actually have multiple specializations in the same binary, which is not a given, only the one being in use or that it will be used very soon will be in L1. Instruction cache prefetching is extremely effective.

flqn · on Dec 21, 2020

It is, since the compiler can inline the comparator.

vardump · on Dec 21, 2020

Sorry for answering my own reply.

Zero overhead cult is pretty strong. People should profile instead. Sigh.

fjfaase · on Dec 21, 2020

Iterators are a very powerful concept, however, it is very cumbersome and often impossible, to implement some interesting iterator based on the standard iterators. This stems from the idea that any iterator can based on a single pointer and use pointer comparison for indicating the start and the end of the contents.

The most common case, are forward output iterators, for which I usually define a class that defines the methods more and next, that can be used as:

for (MyIterator it(data); it.more(); it.next())

Olipro · on Dec 21, 2020

An iterator is a class. It can contain any private data you desire and mutate in any wonderful and/or stupid ways you desire.

How it decides to compare equal to the .end() instance is also your call. Your example is not a compliant iterator (as in, wouldn't work in a range-based for loop) - yet the same behaviour would be achievable through your own operator++() and operator bool(),

fjfaase · on Dec 21, 2020

I agree, but I feel it is a little sad that such a powerful programming concept as an iterator is hard to implement in C++ if you want to make full use of it. Other program languages have made better choices with respect to this and allow you to write cleaner, shorter code.

saurik · on Dec 21, 2020

Why not just use those two operators instead of more and next?

emmanueloga_ · on Dec 21, 2020

It would be nice to write a DSL to generate some of the C++ boilerplate (say, containers, iterators, etc) required as a baseline before starting to solve the actual problem.

A DSL could codify rules like the ones described in this article and generate at least a rough first pass to be hand-tuned later.

pumpkinpal · on Dec 21, 2020

Boost STL Interfaces provides this functionality as a library. Not just for iterators but both views and containers of sequences too.

https://www.boost.org/doc/libs/1_75_0/doc/html/stl_interface...

mhdhn · on Dec 21, 2020

Wow, no, really? wouldn't hat would spoil the "fun" of all the endless arcana?

ribs · on Dec 21, 2020

First sentence: “ An iterator is an object that points to an element inside a container”

This is not the primary purpose of an iterator. Does anyone want to hazard a guess as to what an iterator’s primary purpose is? You? You? Bueller?

missblit · on Dec 21, 2020

The primary purpose of an iterator is to allow algorithms to be written in a generic manner without needing to know details about the type it's operating on. Since the algorithm only operates on the iterator type, any number of collections or other code that exposes that iterator type can be plugged into the same algorithm.

Take `std::copy_n` for example. You _could_ use this to copy from one vector to another:

    std::vector<int> a = {1,2,3}, b = {0,0,0};
    std::copy_n(a.begin(), a.size(), b.begin());

But iterators also allow the input and output collection types to be different:

    std::vector<int> a = {1,2,3};
    std::array<int,3> b;
    std::copy_n(a.begin(), a.size(), b.begin());

Or even for the output to not be associated with a container at all.

    std::vector<int> a = {1,2,3};
    std::copy_n(a.begin(), a.size(), std::ostream_iterator<int>(std::cout, " "));

millstone · on Dec 21, 2020

I think you are arguing that iterators are "primarily" for iteration. But in C++ they really are like an abstraction over C pointers. For example, you can hold a std::list::iterator, and use it to efficiently delete from a linked list, without ever iterating.

Sometimes C pointers are used for iteration, but I would not say that is their primary purpose.

jhayward · on Dec 21, 2020

> I think you are arguing that iterators are "primarily" for iteration

I really wonder why people put up with something named this way then. "That fool thinks iterators are for iteration!"

Clarity of abstraction and good naming are important, unless the practitioners are masochists.

gpderetta · on Dec 22, 2020

Cursor would be a better name probably, but I wouldn't say the name is misleading, as iteration is an important use case.

Thorrez · on Dec 21, 2020

>difference_type

A forward iterator doesn't have a difference_type. A random access iterator does.