(those were quickly googled, apologies if i got any dates wrong)
timing is important. cloudera had good, early timing and looked promising because of it. you are right that EMR definitely hurt all the other hadoop vendors, though i think people overestimate how comfortable big enterprises are with moving to public cloud. way more comfortable today; back then everything was a lot less certain. cloudera's name is unfortunate given they never got anything cloud-based successful, maybe they'd be in a better place now if they had.
but some of the key technologies you suggest as better options came 4-6 years later. that's 4-6 years of providing value, gaining traction, and building a committed customer base. 4-6 years is a long time, and even with how slow many enterprise projects run, more than enough to get entrenched, build tooling that makes a bunch of stuff easier, build mindshare, etc.
> Have I missed out on some cool stuff?
stuff that makes companies money doesn't always look cool.
Regarding your first issue, this is very much a matter of defaults. I can't be sure of your exact pipeline and connectors, but if, for example, you were using the JDBC connector, it has included support for at least prefixing names since the original version, effectively supporting the namespacing you require https://docs.confluent.io/current/connect/connect-jdbc/docs/.... I agree this might not be as ideal as namespacing directly at the Kafka layer for some users. The addition of single message transforms to arbitrarily modify the topic names (based on the existing topic name or really any data in the record or any info in the transformation config) gives a lot more flexibility as of Kafka 0.10.2. On the Hadoop/Hive side, I think there may still be that limitation; transformations effectively remove it since you can arbitrarily adjust the topic the sink connector sees, but this probably isn't an obvious solution. Also, we really would prefer to avoid any coding required when using Connect. It's a difficult tradeoff between standardization (same configs everywhere), usability (minimize configs the user has to set), and simplicity+immediate usability (transformations came later and introduce configuration complexity). I (and other Kafka contributors) are certainly welcome to thoughts on how to make this all simpler; I think most software, especially open source software, errs too heavily on towards configurability, but clearly in your case you found things not configurable enough.
re: the point about backpressure, there are plenty of cases where you don't want backpressure. If you want the thing that's producing data to keep humming along even if some downstream app (lets say Connect dumping the data into HDFS for some downstream batch analytics), you don't want to see backpressure. In Kafka you should just define your retention period to be long enough to cover any slowness/lag in consumer applications -- it's pretty fundamental to its design and use cases that it doesn't have explicit backpressure from consumers back to producers. (You do get backpressure from a single broker back to the producer via the TCP connection, but I assume you meant from consumer back to producer.)
It's not Jepsen, but we actually do a fair amount of system and integration testing, some of which does things like kill nodes (randomly, the controller, etc) and validates data is delivered correctly. There is some ongoing work to add other fault injection tests: https://issues.apache.org/jira/browse/KAFKA-5476
One cool thing that happened recently with these tests is that they were modified to make the client implementation pluggable: https://github.com/apache/kafka/pull/2048 Confluent uses this functionality to test all of its clients (librdkafka, confluent-kafka-python, confluent-kafka-go, confluent-kafka-dotnet) in addition to the Java clients. This not only makes us confident of these clients from their first release, but has also found dozens of bugs in both the clients and the broker implementation itself. Getting automated testing across many clients has really stepped up the quality and robustness of both existing and new features.
For what it's worth, the Jira for KIP-101 was created in January 2014. That has been a known potential Kafka data loss scenario for quite a while, just took some time (and evidently the findings of these new stress tests) to be prioritized as a serious problem that needed to be fixed.
Debezium implements Kafka Connectors, so the serialization is pluggable. Debezium can work with thrift, you just need a thrift converter for Kafka Connect.
It's not particularly efficient, but probably the easiest way to draw Voronoi diagrams on the GPU doesn't require a shader at all. Instead, use an orthographic projection and draw cones for each vertex with the cone's apex at the vertex and its axis oriented into the screen. Then the GPU's z-buffer takes care of choosing which vertex is closes to the given pixel.
Yeah! It's a pretty big omission in my post that I don't talk about other methods of generating Voronoi diagrams, but I wanted to keep it relatively short.
For apt repositories, you might be interested in http://www.aptly.info/, especially if you want to host it on S3 as it integrates very well. As others have mentioned reprepo isn't that tough to use either, but moving it to S3 instead of somewhere else basically amounted to using an S3 URI. It also has some other features that might be handy, e.g. versioning/snapshots and serving your repos locally for testing.
In any case, I find the process of actually creating the packages far more arduous than setting up the repo...
In particular, http://kafka.apache.org/documentation.html#intro_consumers addresses the concept of consumer groups and what ordering is guaranteed. One thing that might be worth noting for the grandparent is that Kafka consumers have an offset commit API that gives some control over how failures are handled. If a consumer dies before committing an offset but after reading data from the broker, a fresh consumer that joins the consumer group can see the same data once the system determines the original has died; that ensures all data will be processed, even in the event of consumer failures.
Kinesis provides the same ordering guarantees. They use different terminology (Kafka topics == Kinesis streams; Kafka partitions == Kinesis shards) but have the same system interface. The details of the APIs used for consumption differ, but they provide the same basic functionality of Kafka's "consumer groups".
Agreed! It's not Docker, but I'm working on getting a decent Vagrant setup included with Kafka: https://issues.apache.org/jira/browse/KAFKA-1173 That supports pulling up a full cluster locally in VirtualBox or in EC2. Just a first cut, but it already makes testing a lot easier for me.
But in it's current state, that patch is a starting point that is really intended more for Kafka developers than for Kafka users. I really like what the Mesosphere folks have done -- great variety of OSes and cloud platforms, plus they do all the heavy lifting of bringing the cluster up for you.
Along similar lines, all of the English Wikipedia is < 10GB, and about 45GB uncompressed: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Eng.... That omits all the history (just the current pages), but still surprising to me how small it seems now.
See the mailing list thread for 2.8.0-RC0 for where to find the bits if you want to test https://lists.apache.org/thread.html/r16894a11aec73abac521ff... and the project site has some "contact" info for mailing lists where these things are announced and advertised (including releases, Kafka Improvement Proposals, and more) https://kafka.apache.org/contact