What is the point of talking performance-by-thread-per-core if raft sits in fron...

agallego · on Dec 12, 2020

so redpanda partitions 'raft' groups per kafka partition. so in the `topic/partition` model every partition is it's own raft group (similar to multi raft in cockroachdb). So it is in fact even more important due to the replication cost and therefore the additional work of checksumming, compression, etc.

Last, a coordinator core for the one of the TCP connections from a client will likely make requests to remote cores (say you receive a request on core 44, but the destination is core 66), so having a thread per core with explicit message passing is pretty fundamental.

    ss::future<std::vector<append_entries_reply>>
    dispatch_hbeats_to_core(ss::shard_id shard, hbeats_ptr requests) {
        return with_scheduling_group(
          get_scheduling_group(),
          [this, shard, r = std::move(requests)]() mutable {
              return _group_manager.invoke_on(
                shard,
                get_smp_service_group(),
                [this, r = std::move(r)](ConsensusManager& m) mutable {
                    return dispatch_hbeats_to_groups(m, std::move(r));
                });
          });
    }

Here is some code that shows importance of accounting the x-core comms explicitly

mirekrusin · on Dec 12, 2020

Ok, thanks. Does redpanda do some kind of auto anti-affinity on hosts for partition group to spread across remote cores?

ps. redpanda link from article is broken, goes to https://vectorized.io/blog/tpc-buffers/vectorized.io/redpand... 404

agallego · on Dec 12, 2020

Oh shoot! thank you... fixing the link give me 5 mins.

So currently the partition allocator - https://github.com/vectorizedio/redpanda/blob/dev/src/v/clus... - is primitive.

But we have a working-not-yet-exposed HTTP admin api on the controller that allows for Out Of Band placement.

so the mechanics are there, but not yet integrated w/ the partition allocator.

Thinking that we integrate w/ k8s more deeply next year.

The thinking at least is that at install we generate some machine labels say in /etc/redpanda/labels.json or smth like that and then the partition allocator can take simple constraints.

I worked on a few schedulers for www.concord.io with Fenzo on top of mesos 6 years ago and this worked nicely for both 'affinity', 'soft affinity' and anti-affinity constraints.

Do you have any thoughts on how you'd like this exposed?