Hacker News new | past | comments | ask | show | jobs | submit | zX41ZdbW's comments login


Thanks! That's a nice 5x improvement. Pretty good for a query that offers only modest opportunity, given that the few columns it asks for are fairly small (`title` being the largest, which isn't that large).

With a more straightforward approach, the tool can be reproduced with just a few queries in ClickHouse.

1. Create a table with styles by authors:

    CREATE TABLE hn_styles (name String, vec Array(UInt32)) ENGINE = MergeTree ORDER BY name
2. Calculate and insert style vectors (the insert takes 27 seconds):

    INSERT INTO hn_styles WITH 128 AS vec_size,
    cityHash64(arrayJoin(tokens(lower(decodeHTMLComponent(extractTextFromHTML(text)))))) % vec_size AS n,
    arrayMap((x, i) -> i = n, range(vec_size), range(vec_size)) AS arr
    SELECT by, sumForEach(arr) FROM hackernews_history GROUP BY by
3. Find nearest authors (the query takes ~50 ms):

    SELECT name FROM hn_styles ORDER BY cosineDistance(vec, (SELECT vec FROM hn_styles WHERE name = 'antirez')) LIMIT 25

        ┌─name────────────┬─────────────────dist─┐
     1. │ antirez         │                    0 │
     2. │ geertj          │ 0.009644324175144714 │
     3. │ mrighele        │ 0.009742538810774581 │
     4. │ LukaAl          │ 0.009787061201638525 │
     5. │ adrianratnapala │ 0.010093164015005152 │
     6. │ prmph           │ 0.010097599441156513 │
     7. │ teilo           │ 0.010187607877663263 │
     8. │ lukesandberg    │  0.01035981357655602 │
     9. │ joshuak         │ 0.010421492503861374 │
    10. │ sharikous       │  0.01043547391491162 │
    11. │ lll-o-lll       │  0.01051205287096002 │
    12. │ enriquto        │ 0.010534816136353875 │
    13. │ rileymat2       │ 0.010591026237771195 │
    14. │ afiori          │ 0.010655186410089112 │
    15. │ 314             │ 0.010768594792569197 │
    16. │ superice        │ 0.010842615688153812 │
    17. │ cm2187          │  0.01105111720031593 │
    18. │ jorgeleo        │ 0.011159407590845771 │
    19. │ einhverfr       │ 0.011296755160620009 │
    20. │ goodcanadian    │ 0.011316316959489647 │
    21. │ harperlee       │ 0.011317367800365297 │
    22. │ seren           │ 0.011390119122640763 │
    23. │ abnry           │ 0.011394133096140235 │
    24. │ PraetorianGourd │ 0.011508457949426343 │
    25. │ ufo             │ 0.011538721312575051 │
        └─────────────────┴──────────────────────┘

https://github.com/ClickHouse/ClickHouse

We are trying to use CMake in a very limited fashion.

For example, any build time environment checks are forbidden (no "try_compile" scripts), and all configuration for all platforms is fixed.

We don't use it for installation and packaging; it is only used for builds. The builds have to be self-contained.

We also forbid using CMake files from third-party libraries. For every library, a new, clean CMake file is written, which contains the list of source files and nothing else.

From this standpoint, there should be no big difference between CMake, Bazel, Buck, GYP, GN, etc.


Hacker News archive is hosted in ClickHouse as a publicly accessible data lake. It is available without sign-up and is updated in real-time. Example:

    # Download ClickHouse:
    curl https://clickhouse.com/ | sh
    ./clickhouse local

    # Attach the table:
    CREATE TABLE hackernews_history UUID '66491946-56e3-4790-a112-d2dc3963e68a'
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = ReplacingMergeTree(update_time)
    ORDER BY id
    SETTINGS refresh_parts_interval = 60, 
        disk = disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false);

    # Run queries:
    SELECT time, decodeHTMLComponent(extractTextFromHTML(text)) AS t
    FROM hackernews_history ORDER BY time DESC LIMIT 10 \G
    
    # Download everything as Parquet/JSON/CSV...
    SELECT * FROM hackernews_history INTO OUTFILE 'dump.parquet'
Also available on the public Playground: https://play.clickhouse.com/

Nice! And the CREATE TABLE in that example is exactly why I'd love to have it with a catalog ;-)

ClickHouse is a single binary. It can be invoked as clickhouse-server, clickhouse-client, and clickhouse-local. The help is available as `clickhouse-local --help`. clickhouse-local also has a shorthand alias, `ch`.

This binary is packaged inside .deb, .rpm, and .tgz, and it is also available for direct download. The curl|sh script selects the platform (x86_64, aarch64 x Linux, Mac, FreeBSD) and downloads the appropriate binary.


If you are interested in network monitoring in Kubernetes, it's worth looking at Kubenetmon: https://github.com/ClickHouse/kubenetmon - an open-source eBPF-based implementation from ClickHouse.


i mean.. from your blog post linked in the repo; this isn't eBPF based?

https://clickhouse.com/blog/kubenetmon-open-sourced

the data collection method says: "conntrack with nf_conntrack_acct"


The article should mention ClickHouse, which fixes almost all the mentioned problems.

It has idempotency tokens for INSERTs, it has type-safe prepared statements that are easy to use, and the settings can be passed at the query time to avoid the session state.

I've covered these advantages in my presentation, https://presentations.clickhouse.com/2025-meetup-seattle/ - after the "How Is It Possible?" section.


Also interesting to see, is a map of all ADS-B emergency broadcasts, which can be: ['general', 'nordo', 'downed', 'lifeguard', 'reserved', 'unlawful', 'minfuel']: https://adsb.exposed/?zoom=5&lat=38.3590&lng=-97.5146&query=...


ClickHouse will be faster for processing large JSON data files. The example from the gist will look like follows:

    ch -q "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity WHERE f.properties.TotalNetValue < 193000 FROM 'data.json'"
Reference: https://jsonbench.com/


I'm willing to believe that will execute in less than 2 seconds, but it doesn't work as given.


Can you post a link to the JSON file?

Try this (the placement of FROM was incorrect):

    ch "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity FROM 'a.json' WHERE f.properties.TotalNetValue < 193000"



ClickHouse can query PostgreSQL directly, plus you can create a refreshable materialized view, which will do it automatically on a schedule:

https://clickhouse.com/docs/sql-reference/table-functions/po...

https://clickhouse.com/docs/materialized-view/refreshable-ma...

Additionally, you can set up incremental import with https://clickhouse.com/blog/postgres-cdc-connector-clickpipe...


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: