Hacker News new | past | comments | ask | show | jobs | submit login
Writing a custom iterator in modern C++ (internalpointers.com)
54 points by signa11 on Dec 21, 2020 | hide | past | favorite | 76 comments



The classic iterator categories (input, output, ...) discussed in this article have been deprecated in C++20, and replaced with new concepts. I would expect a post from 2020 claiming to be about iterators "in modern C++" to be specifically about that change, but actually it's just about the legacy categories, so this is a pretty worthless article.


On the contrary, this article can be used by the audience in almost any modern C++ compiler out there.

An article using C++20 will just lead to frustration, as there are still rough edges using concepts, with a standard that was just published last week.


Sure, this is the technique that you want to use for programs that you write today. But given that it's using the iterator concepts that have existed for decades, that will be deprecated soon if not now, why claim it's "modern C++"? The only thing I could see in the whole article that wouldn't work in C++98 was range-based for loops. At least they didn't use std::iterator, which was deprecated in C++17.

The title is misleading marketing: It gives the impression you need this nice new blog post instead of all the old out of date ones, and that it's all you'll need to know for the foreseeable future.


Well, regarding "modern C++", I have seen that moniker applied to even language features that were already available in pre-ISO C++98, like RAII, or type safe collections.

Apparently too many people still keep using C++ compilers to compile C code, not surprising when there are schools still using Turbo C++ for MS-DOS as teaching tool, as per one of Bjarne talks.

Back to C++20, C++ is no longer my main tool, but I still follow up on it and one of my hobbies is checking new standard features, hardly everything works at is supposed to be and I expect the usual three years for stabilization, so when C++23 will be around the corner, is when I expect all major compilers to fully support C++20 without triggering some kind of compiler error.


> regarding "modern C++", I have seen that moniker applied to even language features that were already available in pre-ISO C++98, like RAII, or type safe collections.

That's not surprising as modern C++ predates C++11. Newer standards "merely" make it more convenient to practice it.

At the end of the day, modern c++ is really about emphasizing value types, RAII and parametric polymorphism as opposed to OOP.

Anyway, these days we have it good conformance to new standards comes relatively quickly. It took forever for most compilers to be C+98 conformant (and technically most never reached it).


It is mentioned in the article early on:

"The first thing to do is to assign the iterator some properties. Until C++17 this is done by tagging it with the tag dispatch mechanism, while C++20 uses concepts: in this article I will follow the traditional approach."


Honest question: what does it mean for something to be "deprecated" in (specifically) the c++ language spec?


The spec itself defines it as:

Normative for the current edition of the Standard, but having been identified as a candidate for removal from future revisions

This is a very blunt tool. In practice, a feature may be deprecated if it is:

1. Legacy inherited from C that serves no modern purpose. The register keyword.

2. Features hardly anybody implemented. Exported templates is the only example I am aware of. Like falling off the edge of the map.

3. Features that proved to be a bad idea, like incrementing a bool.

4. Features that proved to be wholly inadequate, like codecvt for dealing with Unicode.

5. Features that are reasonably new in C++ but obviated by new stuff. You don't need std::bind if you have lambdas.

This is my outsider's perspective; please chime in if you have standards committee insight.


strstream was adopted into C++98 with conflicting implementations, so was deprecated at the outset. You are warned not to use the conflicting parts in code meant to be portable.

strstream has not been removed, yet, because spanstream is not in yet. Until that is in, there are certain things only strstream can do: specifically, attach to an existing buffer you need to parse things out of, or to format things into.


Thanks for the explanation. Given my impression that the standards committee places very strong emphasis on backwards compatibility, I didn't expect there to be a policy to allow for breaking changes.


> I would expect a post from 2020 claiming to be about iterators "in modern C++" to be specifically about that change (...)

...and yet here I am, still struggling to get some of my employer's C++ projects to build with C++14.


Somewhat related but C++20 ranges do not replace iterators and instead build a higher level of abstraction upon iterators. Ranges are implemented in terms of iterator pairs.


Yeah. That actually sucks; often iterators doesn't fit the iteration model of some sequences, but ranges would fit right in if they didn't require separability into two iterators.


Since C++20, the end iterator doesn't have to be the same type. Instead it can be something like a tag with no own runtime data, and comparing with it can be implemented as asking the other iterator if it is 'out of data'.


Yeah that's definitely an improvement in some respects, but (a) generic code still has to deal with 2 iterators, (b) generic code now has to allow them to be different types (so old code will have to be rewritten), (c) it doesn't change the fact that iterators are still treated like cheap objects (generalized pointers) and hence the begin iterator is still potentially expensive to store and copy around regardless of what you do with the end iterator, and (d) it's still a fundamental model mismatch relative to the actual problem, just not quite as unpalatable as before. Ranges fundamentally change the calculus around iteration.

P.S. (e) Is this sentinel-based approach even composable? If your iterator needs to use other iterators underneath, what do you do? You have to store a 'full' iterator anyway... or now you waste even more space and time storing a discriminated union. The fact that a range represents the whole sequence in one object instead of two makes composition intuitive and trivial.


Regarding the comment on discriminated unions, I've actually found the opposite with the new ability to have separate types for sentinels. Before, since both iterators had to be the same type, the end iterator needed some special casing to know it was the end. Allowing it to be a separate type meant all the logic for traversing the underlying sequence could be in the iterator type, and the sentinel could be an empty struct like a tag type used for comparisons only.


The iterator and sentinel do not need to be the same type, they just need to be equality comparable. If you wanted to represent a infinite range you could implement it in terms of a regular iterator and an empty sentinel type to which the regular iterator compares false to.


I'm not sure why you're repeating the comment I replied to, but this is exactly what I replied to in my comment.


It isn't a sentinel is it? He said no runtime data, so it sounds like some kind of empty tag for the type system. I'm not sure if it is related to what he is describing, but C++20 also added [[no_unique_address]] for something to do with truly empty members.


Distinguishing two members of a union with... 0 bytes of extra data? Sounds like magic.

Not 100% sure about the terminology. Pick whatever word might fit it better. I just called it a sentinel.


I thought he was saying the `end` in range begin/end was now allowed to be some empty signifier to tell things to look for the null pointer (null is a sentinel, but `end` itself is now empty). That may not be right, I don't know enough about the new concepts stuff and range seems to be part of that.

edit: looking at this it maybe covers it:

https://foonathan.net/2020/03/iterator-sentinel/


Yes but I'm saying this is kind of moot when you're writing a struct that needs to store the iterators, because it can't get away with just storing the 'begin' -- you can't guarantee the second iterator you're storing will actually be the end. Meaning this iteration model isn't efficiently composable.


Why would you need a union?


Write me some code that shows me how to compose iterator-based ranges with just 1 iterator's worth of space in each range.


It is not so clear to me what you mean. Maybe you can point to a better version (maybe what some other language does), and I can explain how I think the same can be done with C++20 ranges. I believe ranges are strictly more powerful than what other languages provide, and often they can be slimmer/more efficient for the common case.


Can you give an example of such a sequence?


Any kind of generator is basically like this. Imagine one for listing the contents of a directory... maybe it looks like this:

  class DirectoryChildren
  {
    void *dir;
    void *buffer;
    size_t buffer_size;
    size_t buffer_offset;
    ...
  public:
    bool next(FileInfo *info);
  };
Bear in mind the buffer only represents a chunk of intermediate results fetched from the OS; when we reach its end, we have to request more data from the OS.

If you had to make an iterator for this, what would you do? You'd basically need to either (a) create iterators that at least duplicate the entire state of DirectoryChildren on the stack, or (b) make iterators allocate on the heap. Both of these suck, because (1) iterators are passed around and copied all over the place on the assumption that these are fundamentally cheap operations, which is an assumption that easily breaks (like here), and (2) the iterators fundamentally do not have any meaning or utility as separate entities from the range itself.


This functionality already exists in the C++ standard library as std::filesystem::(recursive_)directory_iterator with the default constructor creating a sentinel value. This is solution (1) and this is due to the restriction that prior to C++20 iterators were required to be the same type. C++20 relaxes this restriction and allows the sentinel to be any type as long as it is equality comparable with the other iterator. Since on *nix systems readdir is used under the hood, the sentinel can simply be an empty type that is only equal when the actual iterator's last cached dirent pointer is equal to NULL. The sentinel itself can be an empty type.


Why do you keep repeating the same comment? Are you reading my replies? I already addressed exactly what you're talking about: https://news.ycombinator.com/item?id=25492756


There's six different kinds of iterator?!

Off hand, is it just me or...

- Isn't an iterator that outputs data to its recipient container not an iterator, but rather some type of filling API?

- Isn't a random access iterator not an iterator but rather a slice view?

- Isn't a contiguous iterator a leaky abstraction in that an iterator is just an interface that shouldn't expose how the data backing it is stored?

This iterator kindedness feels very C++ in the worst way possible. IMO the only iterator type that makes any sense is an "Input Iterator" and everything else is some sort of hot mess of inconsistency and, uh, C++.

Even with enough time I'm not sure I could design a worse API.


> Isn't an iterator that outputs data to its recipient container not an iterator, but rather some type of filling API?

Sure, but it makes sense for this api to have the same interface as an iterator so that existing iterators can be used without an adapter.

> Isn't a random access iterator not an iterator but rather a slice view?

It is an iterator that supports additional functions. You can use a random access oterators with functions that take output iterators, forward iterators, multipass iterators etc. Of you had untelated concepts for each use case the system would be more complex.

> Isn't a contiguous iterator a leaky abstraction in that an iterator is just an interface that shouldn't expose how the data backing it is stored?

Seeing thought abstractions and specializing algorithms is what makes STL style generic code efficient.

And it is not inconsistent at all. Stepanov has written in grat details about the theoretical fundations of his design.


Iterators are inconsistent, but this plays to the strengths of C++'s duck-typed generics. The same code can either copy or insert, depending on whether you use std::begin or std::back_inserter; this isn't principled but it's powerful.

The "six types of iterators" overstates it. You don't need to sweat those details unless you are writing boost or something. Just add what you need.

I miss C++ iterators in other languages. For example, in Rust it's hard to talk about "a position in a String" which may be advanced or retreated. Usually you give up, use an integer index, and this is much worse.


Backing up a utf8 stream is nontrivial though.


This is an interesting point. Walking backwards through a UTF8 byte stream is easy: just look for the next non-continuation byte, and Rust really can do this, though the API is buried under `str.chars().rev()`.

C++ is structured around iterators that point into; this enables e.g. writing into the guts of a std::map but not the guts of a std::string if you care about Unicode, since the write might change the string's byte length.


> Isn't a random access iterator not an iterator?

Why would it not be? It still allows you to iterate over an entire collection, however the order is left undefined so that certain performance characteristics can be exposed to the underlying implementation.

This is the kind of tool you'd reach for if you don't care about the order of a collection, but you do care about performance, and probably care about the entire collection.

For example _pairs_ from Lua, used to iterate over key/values in tables, is a random-access iterator.

Traversal of the collection is still happening.


Random access iterators mean you can efficiently advance or retreat an iterator by an arbitrary amount. So a vector (via pointer arithmetic) but not a linked list.


Right but why should this matter at the type level? All you need to implement is next() and annotate it with documentation of the performance characteristics.


Yeah, this is a good question. Scott Myers has a talk where he pokes fun at the "guaranteed O(N log N)" sort of a linked list, where the guarantee is in number of comparisons and not the (very large) number of pointer chases.

Why describe at the type level the difference between a backwards pointer chase and a pointer decrement? The hope is: if you know at compile time that your iterator is efficient in this way, you can pick a more specialized algorithm, say, quicksort instead of mergesort. C++ really does enable this at the type level through template specialization. In practice the STL collections haven't lived up to this, and I think the iterator tags are mostly deadweight.


Thanks, that was helpful. I suspect the reason it ends up being dead weight is that it doesn't make much sense to prevent an API customer from using an iterator with certain performance characteristics with certain algorithms. Ultimately, the onus is on the developer to know something about the data; for instance, why wouldn't I be allowed to use any iteration algorithm on my slow-to-increment iterator if I know the length is 1?

So long as it's safe I think it's on the designer to make sure it's efficient, and I suspect that's where the API has gravitated to.


You can think of it as implementing `seek`, not just `next`.

`RandomAccessIterator` implements the `+=` and `-=` operations that `BidirectionalIterator` does not. Algorithms that accept `RandomAccessIterator` can then assume that calling += is possible.


What's the advantage to using a random access iterator over a get() indexed? Like to me the point of an iterator is that it abstracts over access kind, and exposes the minimum viable view, which is sequential forward and possibly backward access.

If you want indexed access why wouldn't you just, you know, use anything that implements operator[]?


Just that iterators is the API that C++ settled on for doing this in a generic way (well concepts in C++20, but I haven't learned that yet).

It's totally 100% kosher to write a function that calls `.at()` instead of `operator+=`. As long as it only needs to work on vectors, strings, or other stuff with a `.at()` method.

But if the function works on iterators instead it can work on ANY data structure that also fulfills the requirements of RandomAccessIterator API. String Views, raw pointers, a 3rd party Cord library, etc.

There's no inherent meaning to the iterator API requiring `operator+=` instead of something closer to `get()`, except that it means that the iterator itself doesn't need to know anything about container bounds (like I said above, raw pointers work as RandomAccessIterators).


[] alone doesn't imply that the view is sequential. Binary search for example requires both sequential items and indexed access. You can't binary search an std::map even if it implements [].


You need an additional API to create views/sublists if you want to transparently operate on sub ranges and if you want to return a found index you may need to convert its value to apply to a parent view.


The way the random access iterator was described in the article is not that it allows iteration over an un-ordered collection but rather that it allows random access within the collection. This would not necessarily map well to an un-ordered collection. This appears to be born out by the API which supports arbitrary addition and subtraction to the iterator. I could of course be mistaken.

What you're describing based on my experience with other languages is simply an "input iterator" over an un-ordered collection. An iterator does not guarantee deterministic/repeatable ordering, just defined ordering (as in, guarantees it won't visit the same element twice), in my experience.


Well, I mean, iterating over a deque, map, vector, and list, in its own, already require the pointer to the next element to decide whether to increment by the type of its size, reference the `next` pointer, or the left/right pointer, or a combination of any of the above 3. It's not the language's fault, it's computer science


You seem to have mixed up an objection to the API with an objection to the user of the word "iterator" for something a bit more general than the normal/original meaning. That conflation isn't terribly helpful. Words can have more than one meaning. So long as everyone's on the same page, that's OK.


It's funny how two of the most common bandwagon posts you can see on HN and other programming venues is:

* C++ is a terrible bloated language that no one should use.

* The power of modern computers is wasted on lazy software engineers writing software with too many layers of abstraction.


I mean, C++ has lots of compile-time abstractions, but unlike with other languages, it can collapse a lot of them during compilation via its inlining and other optimizations. You don't get that with a language like Python...


I don't see a contradiction between these two statements. There are more languages than C++ and Python. You could use Rust, or some of the languages targeting the JVM or the CLI.


To many, these statements are saying the same thing. Your average C++ codebase is a bloated mess with too many layers of abstraction.


You don't need to use a "terribly bloated language" to write efficient software.


C/C++ may be better understood as a template processing system around assembly.


Keep in mind that most of the details you mentioned are for the developer of the library, not the end user/consumer of the library.

Ultimately most consumers of the library care if they can call std::sort(YourSpecialContainer.begin(), YourSpecialContainer.end()); in a consistent manner with some semblance that the use of templates and interfaces result in nearly zero overhead of the abstraction.


"nearly zero overhead"

I can't call anything "nearly zero overhead" that can generate large amounts of code. Not at least with a pure conscience.

I've seen enough many cases of better performance with more "overhead" but drastically smaller code size.


Code size is not always a good performance metric


Nothing is a good performance metric apart from profiling. Preferably with runtime conditions (CPU load, memory bandwidth, etc.) closely mirroring production setting(s).

It's very easy to forget things like memory bandwidth, inter-core/CPU links, etc. are all limited resources. And that on real systems resources are shared.


I usually profile everything with valgrind callgrind


Usually people say std::sort is faster than qsort


It's true whenever the function call overhead for the comparison operation dominates. Inlining a trivial operation (often just one instruction) is one extreme.

At the other extreme L1C cache (typically just 32 kB) gets constantly trashed by numerous type specialized functions. Individually fast when microbenchmarked, but collectively slow.

That's why profiling is a must when high performance is required. Microbenchmarks can be very misleading.


Why would all template specializations for be in L1?

Even assuming you actually have multiple specializations in the same binary, which is not a given, only the one being in use or that it will be used very soon will be in L1. Instruction cache prefetching is extremely effective.


It is, since the compiler can inline the comparator.


Sorry for answering my own reply.

Zero overhead cult is pretty strong. People should profile instead. Sigh.


Iterators are a very powerful concept, however, it is very cumbersome and often impossible, to implement some interesting iterator based on the standard iterators. This stems from the idea that any iterator can based on a single pointer and use pointer comparison for indicating the start and the end of the contents.

The most common case, are forward output iterators, for which I usually define a class that defines the methods more and next, that can be used as:

for (MyIterator it(data); it.more(); it.next())


An iterator is a class. It can contain any private data you desire and mutate in any wonderful and/or stupid ways you desire.

How it decides to compare equal to the .end() instance is also your call. Your example is not a compliant iterator (as in, wouldn't work in a range-based for loop) - yet the same behaviour would be achievable through your own operator++() and operator bool(),


I agree, but I feel it is a little sad that such a powerful programming concept as an iterator is hard to implement in C++ if you want to make full use of it. Other program languages have made better choices with respect to this and allow you to write cleaner, shorter code.


Why not just use those two operators instead of more and next?


It would be nice to write a DSL to generate some of the C++ boilerplate (say, containers, iterators, etc) required as a baseline before starting to solve the actual problem.

A DSL could codify rules like the ones described in this article and generate at least a rough first pass to be hand-tuned later.


Boost STL Interfaces provides this functionality as a library. Not just for iterators but both views and containers of sequences too.

https://www.boost.org/doc/libs/1_75_0/doc/html/stl_interface...


Wow, no, really? wouldn't hat would spoil the "fun" of all the endless arcana?


First sentence: “ An iterator is an object that points to an element inside a container”

This is not the primary purpose of an iterator. Does anyone want to hazard a guess as to what an iterator’s primary purpose is? You? You? Bueller?


The primary purpose of an iterator is to allow algorithms to be written in a generic manner without needing to know details about the type it's operating on. Since the algorithm only operates on the iterator type, any number of collections or other code that exposes that iterator type can be plugged into the same algorithm.

Take `std::copy_n` for example. You _could_ use this to copy from one vector to another:

    std::vector<int> a = {1,2,3}, b = {0,0,0};
    std::copy_n(a.begin(), a.size(), b.begin());
But iterators also allow the input and output collection types to be different:

    std::vector<int> a = {1,2,3};
    std::array<int,3> b;
    std::copy_n(a.begin(), a.size(), b.begin());
Or even for the output to not be associated with a container at all.

    std::vector<int> a = {1,2,3};
    std::copy_n(a.begin(), a.size(), std::ostream_iterator<int>(std::cout, " "));


I think you are arguing that iterators are "primarily" for iteration. But in C++ they really are like an abstraction over C pointers. For example, you can hold a std::list::iterator, and use it to efficiently delete from a linked list, without ever iterating.

Sometimes C pointers are used for iteration, but I would not say that is their primary purpose.


> I think you are arguing that iterators are "primarily" for iteration

I really wonder why people put up with something named this way then. "That fool thinks iterators are for iteration!"

Clarity of abstraction and good naming are important, unless the practitioners are masochists.


Cursor would be a better name probably, but I wouldn't say the name is misleading, as iteration is an important use case.


>difference_type

A forward iterator doesn't have a difference_type. A random access iterator does.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: