More

AdamProut · on July 3, 2023

Yeah for workloads with any long running write transactions a single writer design is a pretty big limitation. Say some long running data load (or a big bulk deletion) running along with some faster high throughput key value writes - the big data load would block all the faster key-value writes when it runs.

No "mainstream" database I'm aware of has a global single writer design.

AdamProut · on June 8, 2023

This may depend on if your use case is only vector search in isolation (ANN lookups). In this scenario pgvector is potentially not the best option (https://ann-benchmarks.com/)

That said, using pgvector (or using other SQL databases with vector search support.. many have this capability) will let you do both ANN and your usual SQL filtering and joining (or full text, or.. etc) to produce more hand tuned results to a query. This is something the specialized vector databases don't have much support for.

AdamProut · on Jan 3, 2023

Most SQL Analytical databases don't discourage joins (think BigQuery, Redshift, Snowflake, etc.) and all of them are columnstores. I think discouraging joins is something very specific to Clickhouse, Druid, Pinot and others that have very limited support for joins.

hodgesrm · on Jan 3, 2023

ClickHouse at least supports local joins quite well.

Perhaps another way to put it is that BigQuery, Redshift, and Snowflake are not optimized for real-time response on large, wide tables. ClickHouse has features that allow it to pack multiple entities in a single table, then pull data out in a single scan. This includes tricks like simulating joins using aggregation. [0] It's a great design for feeding tenant dashboards with fixed latency (say 2s or less) and predictable cost. This use case is shared by many SaaS offerings.

Over time I think the differences will become less as current database engines converge on commonly required features. ClickHouse join types have expanded over the last year and features like join reordering are in the 2023 roadmap. [1] Conversely incumbent cloud databases are adding features to support real-time analytics.

I work on ClickHouse at Altinity and can't speak for Druid and Pinot. Perhaps someone else with detailed knowledge can chip in.

[0] https://www.databricks.com/dataaisummit/session/opening-floo...

[1] https://github.com/ClickHouse/ClickHouse/issues/44767

AdamProut · on Jan 2, 2023

There are hybrid designs for separation of storage and compute that are aimed at mixed workloads[1]. They avoid writes to remote storage on transaction commit (i.e., act like a shared nothing databases for commits, but still push data asynchronously to a shared remote disk that can be used for scale up/point in time restores/branching).

[1] https://www.singlestore.com/blog/separating-storage-and-comp...

(disclosure: SingleStoreDB CTO)

zX41ZdbW · on Jan 3, 2023

SingleStore is a kind of uninteresting example because it is not open-source. It makes potential customers feel non-confident. Why move large data volumes to SingleStore if the company can cease to exist in a year?

AdamProut · on Jan 3, 2023

Yep, SingleStoreDB (formerly MemSQL) is not open source (probably never will be), but it does have many paying customers who have had workloads in production for over a decade at this point.

Also, isn't taking a bet on a very recently launched database as a service based around an open source database also pretty risky? Say for example the very recently launched Clickhouse Inc. service (which your a co-founder of?).

zX41ZdbW · on Jan 3, 2023

Touché.

password4321 · on Jan 3, 2023

I'm interested to see when SingleStore will support ARM.

https://www.singlestore.com/forum/t/singlestore-for-arm64-m1...

nikita · on Jan 2, 2023

Adam, you still synchronously replicate each log record to 2 places, right? Technically this should be roughly equivalent to writing into a consensus. And use write through cache for reads.

AdamProut · on Jan 2, 2023

Yep, for writes network bandwidth usage is independent of separation of storage of compute in some sense. Any database that provides high availability is writing over the network somewhere before it acks a transaction committed. It matters who is on the other end of that network write though. Take the typical Cloud DW (i.e., Snowflake) design of forcing writes to the blob storage (shared remote disks) before commit. That is a much higher latency write then what a high performance transaction log replication protocol will do to replicate a write to another host.

AdamProut · on Nov 16, 2022

Columnstores can do row level access by trading off a bit in terms of compression. If you organize the columnstore files as an LSM tree and use incremental compression schemes (so you don't have to compress too many more rows then the one your after) it can get close to the performance of a B-Tree for point reads - it depends on the specific table schema.

[1] https://dl.acm.org/doi/abs/10.1145/3514221.3526055

AdamProut · on Oct 15, 2022

Yandex (Google of Russia) is a large share holder of Clickhouse Inc[1]. Yes, this maybe a European Yandex subsidiary.. I really don't care to dig, but I think its dishonest to say the company doesn't have Russian backers. Has Clickhouse Inc forced Yandex to divest their large ownership stake?

[1] https://www.crunchbase.com/organization/clickhouse/company_f...

datalopers · on Oct 15, 2022

> I really don't care to dig

Then why are you opining on the subject?

AdamProut · on Oct 15, 2022

Because half truths like your statements bother me? Yandex is a large investor in clickhouse Inc. (at last as far as public record shows). Have they forced Yandex to divest their holdings?

datalopers · on Oct 15, 2022

It's literally in the article I linked (https://clickhouse.com/blog/we-stand-with-ukraine). As you're incapable of reading let me highlight it for you:

> We have no operations in Russia, no Russian investors, and no Russian members of our Board of Directors.

It's disappointing after 11+ yrs of failing to find traction for MemSQL, this is the hill you die on? You're running your mouth on some random comments thread about a shitty 2-person analytics platform where the author clearly doesn't know the first thing about databases and just went with the first Twitter ad he came across.

Under Raj's "leadership" you know damn well MemSQL/SingleStore will be dead within two years. The hyper-aggressive sales pressure had some success during the lofty bull market due to clueless IT managers with massive budgets, but that time has come to an end. You guys missed the window to IPO to get your own exit and your runway is quickly diminishing.

AdamProut · on Oct 15, 2022

I read the article. And yes, this was probably a mistake to engage with you on it.

Yandex N.V. (A dutch holding company of Yandex Russia) is listed as an investor of Clickhouse Inc. This is all I'm stating.

begonealong · on Oct 16, 2022

This article makes a lot more sense when you know that Singlestore/MemSQL gifted Jack Ellis shares.

JackWritesCode · on Oct 16, 2022

Lmao. Hey, if I wasn't running Fathom, Dev Rel @ SingleStore is the only role I'd consider in tech right now. And I would definitely try to get some of those sweet shares as part of compensation. Alas, I don't own any SingleStore shares.

datalopers · on Oct 16, 2022

Hah. What a joke. So it's all astroturfing and fake evangelism.

AdamProut · on Oct 15, 2022

Clickhouse is great a single table analytical queries (Group-by + aggregates + filters over a single table). It can match any top tier columnstore database at this, maybe even better at smaller scales, and if that is all you need it will do the job well

Beyond this, it has a naive query optimizer and limited ability to run distributed joins. This is why you won't find well known analytical benchmark results like TPC-H and TPC-DS for clickhouse vs other SQL data warehouses.

SinglestoreDB has been doing real time analytics much longer then clickhouse and has a much more mature feature set at this point[1][2]. Our go to market is more big enterprise driven so we are definitely much less well known among developers. Revenue share-wise I suspect we are the leader in real time analytics (see disclaimer below - I'm biased but our revenue is reaching thresholds for IPO readiness).

Also, if you follow Jacks later posts you'll see he replaces DynamoDB and Redis with SinglestoreDB as well. Now were getting into places Clickhouse doesn't tread at all...

Disclaimer: I'm one of the cofounders of MemSQL/SingleStoreDB (still working away on making it better all these years later...).

[1] https://www.singlestore.com/blog/the-technical-capabilities-...

[2] https://dl.acm.org/doi/abs/10.1145/3514221.3526055

qoega · on Oct 15, 2022

Can you provide SingleStore result for TPC-H and TPC-DS? I can't find it. Why? [1] https://www.tpc.org/tpcds/results/tpcds_results5.asp?orderby... [2] https://www.tpc.org/tpch/results/tpch_results5.asp?orderby=d... Disclaimer: I work for ClickHouse

AdamProut · on Oct 15, 2022

Almost no one does "official" TPC results these days (maybe other then Oracle and some of the Chinese vendors).

Most other cloud DWs have public TPC-H or TPC-DS results that are easily googlable. Clickhouse is missing for a reason...

  - https://research.gigaom.com/report/data-warehouse-cloud-benchmark/
  - https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html
  - https://celerdata.com/blog/starrocks-queries-outperform-clickhouse-apache-druid-and-trino
  - https://aws.amazon.com/blogs/big-data/amazon-redshift-continues-its-price-performance-leadership/

Our results are here: https://www.singlestore.com/blog/tpc-benchmarking-results/

AdamProut · on Oct 13, 2022

I'm not sure why this is being downvoted? Dividing up memory statically at startup/compile time has some nice benefits described in the article, but it also makes the database less flexible to changing workload requirements. For example, if a workload wants to do a big hash join that needs 90% of the memory at time X and then a big data load that needs most of the memory in a completely different sub system at time Y - its nice to be able to dynamically shift the memory around to match the workload. Doing static up front allocations seems to block this from happening, and that is maybe a fine trade-off for TigerBeetleDB based on the types of workloads they're optimizing for, but its definitely a valid dis-advantage of their approach.

AdamProut · on Oct 4, 2022

I was keeping a tally of how many companies were offering "ClickHouse as a service" at one point last year. I think I got up to 7 or 8.

It will be interesting to watch this unfold from a code licensing perspective. Will Clickhouse Inc. move to a more restrictive license to block all these other ClickHouse services?

zX41ZdbW · on Oct 5, 2022

> It will be interesting to watch this unfold from a code licensing perspective. Will Clickhouse Inc. move to a more restrictive license to block all these other ClickHouse services?

I don't see it as a reasonable move.

AdamProut · on Sept 22, 2022

This is a reasonable benchmark from the Clickhouse folks for single table anlaytical query performance over smallish data sets (10s of GB of data). Most of the DW vendors are on there.

https://benchmark.clickhouse.com/

Timescale apparently lags pretty far behind modern columnstore engines.

akulkarni · on Sept 22, 2022

(Timescale co-founder)

As with anything, it depends on what you want to do.

If you have an OLAP heavy workload with long scans, etc (which is the type of queries prominent on the ClickHouse page - e.g., Q0 is "SELECT COUNT(*) FROM hits;"), then I would highly recommend systems other than Timescale. (Although we are also working on this ;-) )

But if you have time-series workload, or even, if you love Postgres and are building a time-series and/or analytical application, then I would recommend TimescaleDB.

ClickHouse is great. I just believe in using the right tool for the right job. :-) There are many areas where column store engines beat TimescaleDB. But nothing comes for free - everything has a tradeoff.

didip · on Sept 22, 2022

Genuinely curious, aren't a lot of time-series workloads OLAP oriented?