More

zX41ZdbW · 2025-04-22T17:14:00 1745342040

I checked, and yes - it works: https://pastila.nl/?002a2e01/31807bae7e114ca343577d263be7845...

tmoertel · 2025-04-22T17:52:01 1745344321

Thanks! That's a nice 5x improvement. Pretty good for a query that offers only modest opportunity, given that the few columns it asks for are fairly small (`title` being the largest, which isn't that large).

zX41ZdbW · 2025-04-19T05:00:00 1745038800

With a more straightforward approach, the tool can be reproduced with just a few queries in ClickHouse.

1. Create a table with styles by authors:

    CREATE TABLE hn_styles (name String, vec Array(UInt32)) ENGINE = MergeTree ORDER BY name

2. Calculate and insert style vectors (the insert takes 27 seconds):

    INSERT INTO hn_styles WITH 128 AS vec_size,
    cityHash64(arrayJoin(tokens(lower(decodeHTMLComponent(extractTextFromHTML(text)))))) % vec_size AS n,
    arrayMap((x, i) -> i = n, range(vec_size), range(vec_size)) AS arr
    SELECT by, sumForEach(arr) FROM hackernews_history GROUP BY by

3. Find nearest authors (the query takes ~50 ms):

    SELECT name FROM hn_styles ORDER BY cosineDistance(vec, (SELECT vec FROM hn_styles WHERE name = 'antirez')) LIMIT 25

        ┌─name────────────┬─────────────────dist─┐
     1. │ antirez         │                    0 │
     2. │ geertj          │ 0.009644324175144714 │
     3. │ mrighele        │ 0.009742538810774581 │
     4. │ LukaAl          │ 0.009787061201638525 │
     5. │ adrianratnapala │ 0.010093164015005152 │
     6. │ prmph           │ 0.010097599441156513 │
     7. │ teilo           │ 0.010187607877663263 │
     8. │ lukesandberg    │  0.01035981357655602 │
     9. │ joshuak         │ 0.010421492503861374 │
    10. │ sharikous       │  0.01043547391491162 │
    11. │ lll-o-lll       │  0.01051205287096002 │
    12. │ enriquto        │ 0.010534816136353875 │
    13. │ rileymat2       │ 0.010591026237771195 │
    14. │ afiori          │ 0.010655186410089112 │
    15. │ 314             │ 0.010768594792569197 │
    16. │ superice        │ 0.010842615688153812 │
    17. │ cm2187          │  0.01105111720031593 │
    18. │ jorgeleo        │ 0.011159407590845771 │
    19. │ einhverfr       │ 0.011296755160620009 │
    20. │ goodcanadian    │ 0.011316316959489647 │
    21. │ harperlee       │ 0.011317367800365297 │
    22. │ seren           │ 0.011390119122640763 │
    23. │ abnry           │ 0.011394133096140235 │
    24. │ PraetorianGourd │ 0.011508457949426343 │
    25. │ ufo             │ 0.011538721312575051 │
        └─────────────────┴──────────────────────┘

zX41ZdbW · 2025-04-16T04:13:51 1744776831

https://github.com/ClickHouse/ClickHouse

We are trying to use CMake in a very limited fashion.

For example, any build time environment checks are forbidden (no "try_compile" scripts), and all configuration for all platforms is fixed.

We don't use it for installation and packaging; it is only used for builds. The builds have to be self-contained.

We also forbid using CMake files from third-party libraries. For every library, a new, clean CMake file is written, which contains the list of source files and nothing else.

From this standpoint, there should be no big difference between CMake, Bazel, Buck, GYP, GN, etc.

zX41ZdbW · 2025-04-11T20:45:41 1744404341

Hacker News archive is hosted in ClickHouse as a publicly accessible data lake. It is available without sign-up and is updated in real-time. Example:

    # Download ClickHouse:
    curl https://clickhouse.com/ | sh
    ./clickhouse local

    # Attach the table:
    CREATE TABLE hackernews_history UUID '66491946-56e3-4790-a112-d2dc3963e68a'
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = ReplacingMergeTree(update_time)
    ORDER BY id
    SETTINGS refresh_parts_interval = 60, 
        disk = disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false);

    # Run queries:
    SELECT time, decodeHTMLComponent(extractTextFromHTML(text)) AS t
    FROM hackernews_history ORDER BY time DESC LIMIT 10 \G
    
    # Download everything as Parquet/JSON/CSV...
    SELECT * FROM hackernews_history INTO OUTFILE 'dump.parquet'

Also available on the public Playground: https://play.clickhouse.com/

flakiness · 2025-04-11T23:47:11 1744415231

Nice! And the CREATE TABLE in that example is exactly why I'd love to have it with a catalog ;-)

zX41ZdbW · 2025-04-09T16:14:49 1744215289

ClickHouse is a single binary. It can be invoked as clickhouse-server, clickhouse-client, and clickhouse-local. The help is available as `clickhouse-local --help`. clickhouse-local also has a shorthand alias, `ch`.

This binary is packaged inside .deb, .rpm, and .tgz, and it is also available for direct download. The curl|sh script selects the platform (x86_64, aarch64 x Linux, Mac, FreeBSD) and downloads the appropriate binary.

zX41ZdbW · 2025-04-08T20:51:46 1744145506

If you are interested in network monitoring in Kubernetes, it's worth looking at Kubenetmon: https://github.com/ClickHouse/kubenetmon - an open-source eBPF-based implementation from ClickHouse.

thewisenerd · 2025-04-09T04:37:07 1744173427

i mean.. from your blog post linked in the repo; this isn't eBPF based?

https://clickhouse.com/blog/kubenetmon-open-sourced

the data collection method says: "conntrack with nf_conntrack_acct"

zX41ZdbW · 2025-03-31T19:53:45 1743450825

The article should mention ClickHouse, which fixes almost all the mentioned problems.

It has idempotency tokens for INSERTs, it has type-safe prepared statements that are easy to use, and the settings can be passed at the query time to avoid the session state.

I've covered these advantages in my presentation, https://presentations.clickhouse.com/2025-meetup-seattle/ - after the "How Is It Possible?" section.

zX41ZdbW · 2025-03-20T19:12:15 1742497935

Also interesting to see, is a map of all ADS-B emergency broadcasts, which can be: ['general', 'nordo', 'downed', 'lifeguard', 'reserved', 'unlawful', 'minfuel']: https://adsb.exposed/?zoom=5&lat=38.3590&lng=-97.5146&query=...

zX41ZdbW · 2025-03-19T01:55:51 1742349351

ClickHouse will be faster for processing large JSON data files. The example from the gist will look like follows:

    ch -q "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity WHERE f.properties.TotalNetValue < 193000 FROM 'data.json'"

Reference: https://jsonbench.com/

jeffbee · 2025-03-19T03:32:04 1742355124

I'm willing to believe that will execute in less than 2 seconds, but it doesn't work as given.

rustc · 2025-03-19T06:02:34 1742364154

Can you post a link to the JSON file?

Try this (the placement of FROM was incorrect):

    ch "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity FROM 'a.json' WHERE f.properties.TotalNetValue < 193000"

jeffbee · 2025-03-19T14:27:10 1742394430

It's a public data file accessible at https://data.acgov.org/datasets/2b026350b5dd40b18ed7a321fdcd...

zX41ZdbW · 2025-02-22T23:12:06 1740265926

ClickHouse can query PostgreSQL directly, plus you can create a refreshable materialized view, which will do it automatically on a schedule:

https://clickhouse.com/docs/sql-reference/table-functions/po...

https://clickhouse.com/docs/materialized-view/refreshable-ma...

Additionally, you can set up incremental import with https://clickhouse.com/blog/postgres-cdc-connector-clickpipe...