log4shell's comments

log4shell · 2025-01-24T15:45:26 1737733526

Calling a WAL a ledger, why? Ledger sounds fancier but why would it be a ledger in this case?

trollbridge · 2025-01-25T02:54:08 1737773648

We called it a ledger since we stored financial data and basically used the “ledger” format from plain text accounting, initially.

log4shell · 2025-01-25T17:53:48 1737827628

That answers my question, thanks! Ordering does not matter part has me curious too, I will read other comment and come back to you l.

manoDev · 2025-01-24T15:55:52 1737734152

I believe "ledger" implies commutative property (order does not matter).

log4shell · 2025-01-24T18:35:20 1737743720

I am not aware of any such implicit connection of ledger and commutative property, also couldn't find anything as my google-fu is letting me down. Anything I can refer to? Generally curious to know use of term ledger outside of accounting and blockchains.

I have seen it used to mean WAL before, so I am taking this with a dose of skepticism.

manoDev · 2025-01-26T15:46:41 1737906401

In a double-entry ledger the order of transactions doesn't matter, the balance is the sum of entries.

Depending on the data model of your log, if calculating the current state is a commutative operation, I think it's fair to call it a "ledger".

log4shell · 2025-01-23T18:11:39 1737655899

Good call-out, should have named the query engines lagging behind too!

I am wondering how portable is parquet format and how interchangeable it is now?

log4shell · 2024-09-10T14:38:19 1725979099

It is refreshing to see multiple projects with arrow/datafusion trying to bank on existing and user friendly spark's API instead of reinventing the API all over again.

There is likes of comet and blaze that replace execution backend of spark with datafusion and then you have single process alternatives like sail trying to settle in "not so big data" category.

I am watching evolution of projects powered by datafusion and compatible with spark with keen eye. Early days but quite exciting.

log4shell · 2024-09-10T14:33:28 1725978808

What is your use case?

log4shell · 2024-09-09T16:42:48 1725900168

Congratulations to duckdb team! Can't wait to try some of the newly released features and performance improvements.

I am quite curious about the plans for python dataframe like API for duckdb, and python ecosystem in general.

exergy · 2024-09-09T17:06:14 1725901574

there is Ibis[0] as a fairly mature package. They recently adopted duckdb as the default execution engine and it can give you a nice python dataframe API ontop of duckdb, with hot-swappability towards heavier engines.

With tools like this providing a comprehensive python API and the ability to always fall back to raw SQL, i am not sure DuckDB devs should focus on the python API at all beyond basic (to_table, from_table) features.

Impressive progress and a real chance to shake up the data tool market, but still a way to go: There is is still much to do especially on large table formats (iceberg/delta) and memory management when running on bigger boxes on cloud. Eg the elusive "Failed to allocate ..." bug[1] is an inhibitor to the claim that big data is dead[2]. As it is, we tried and abandoned DuckDB as a cheaper replacement for some databricks batch jobs.

[0] https://github.com/ibis-project/ibis [1] https://github.com/duckdb/duckdb/issues/12667, https://github.com/duckdb/duckdb/issues/9880, https://github.com/duckdb/duckdb/issues/12528 [2] https://motherduck.com/blog/big-data-is-dead/

cmdlineluser · 2024-09-09T17:16:05 1725902165

The last I read, the Spark API was to become the focus point.

https://duckdb.org/docs/api/python/spark_api

Not sure what the current status is.

ref: https://github.com/duckdb/duckdb/issues/2000#issuecomment-18...

log4shell · 2024-09-05T18:22:09 1725560529

Its great to have a single entrypoint for multiple backends. What I am trying to understand and couldn't find much information related to: How does the use of multiple engines in Ibis impact the consistency of results for the same input and query, particularly in relation to semantic differences among the engines?

log4shell · 2024-09-04T09:53:28 1725443608

Is there a way for general public to see the status of open sluice gates and water levels in various parts of netherlands? A live datastream might be the best!

jorams · 2024-09-04T10:32:12 1725445932

Rijkswaterstaat publishes quite a lot of information online. Water levels can be seen on a map[1] on waterinfo.rws.nl. It also offers access to historic data in CSV format, but that is handled through a wizard and you get the data by email apparently.

Information about sluices, bridges, etc. can be found on vaarweginformatie.nl, including the live status of many of them. The Ijmuiden sluice complex doesn't have a live status it seems. See [2] for the map.

[1]: https://waterinfo.rws.nl/#/publiek/waterhoogte [2]: https://www.vaarweginformatie.nl/frp/main/#/geo/map

Gasp0de · 2024-09-04T10:01:24 1725444084

I guess 99% of the population would not be qualified to tell if an open sluice gate is problematic under certain circumstances. If it were as easy as "Sluice gate open && water level > X" I am 100% sure there would already be an automation for it.

log4shell · on Dec 16, 2021

Druid has quite some intelligence baked in to handle the scaling by default. I am curious how clickhouse is doing in all those aspects.

When we did a PoC, the operational aspect of clickhouse and performance was severely lacking as compared to druid. Clickhouse had bigger resources at its disposal than druid during this PoC.

If they could improve the operational aspect and introduce sensible defaults so that the users don't have to go through 10000 configuration to work with data in clickhouse, I am sure I will give it a go for some other usecase. It is simple on surface but devil is in the details. Druid is much simpler and sane at the scale I need to operate.

log4shell · on Dec 12, 2021

Kafka is not vulnerable for this particular exploit but hdfs kafka connect plugin is.