Thanks! That's a nice 5x improvement. Pretty good for a query that offers only modest opportunity, given that the few columns it asks for are fairly small (`title` being the largest, which isn't that large).
With a more straightforward approach, the tool can be reproduced with just a few queries in ClickHouse.
1. Create a table with styles by authors:
CREATE TABLE hn_styles (name String, vec Array(UInt32)) ENGINE = MergeTree ORDER BY name
2. Calculate and insert style vectors (the insert takes 27 seconds):
INSERT INTO hn_styles WITH 128 AS vec_size,
cityHash64(arrayJoin(tokens(lower(decodeHTMLComponent(extractTextFromHTML(text)))))) % vec_size AS n,
arrayMap((x, i) -> i = n, range(vec_size), range(vec_size)) AS arr
SELECT by, sumForEach(arr) FROM hackernews_history GROUP BY by
We are trying to use CMake in a very limited fashion.
For example, any build time environment checks are forbidden (no "try_compile" scripts), and all configuration for all platforms is fixed.
We don't use it for installation and packaging; it is only used for builds. The builds have to be self-contained.
We also forbid using CMake files from third-party libraries. For every library, a new, clean CMake file is written, which contains the list of source files and nothing else.
From this standpoint, there should be no big difference between CMake, Bazel, Buck, GYP, GN, etc.
Hacker News archive is hosted in ClickHouse as a publicly accessible data lake. It is available without sign-up and is updated in real-time. Example:
# Download ClickHouse:
curl https://clickhouse.com/ | sh
./clickhouse local
# Attach the table:
CREATE TABLE hackernews_history UUID '66491946-56e3-4790-a112-d2dc3963e68a'
(
update_time DateTime DEFAULT now(),
id UInt32,
deleted UInt8,
type Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
by LowCardinality(String),
time DateTime,
text String,
dead UInt8,
parent UInt32,
poll UInt32,
kids Array(UInt32),
url String,
score Int32,
title String,
parts Array(UInt32),
descendants Int32
)
ENGINE = ReplacingMergeTree(update_time)
ORDER BY id
SETTINGS refresh_parts_interval = 60,
disk = disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false);
# Run queries:
SELECT time, decodeHTMLComponent(extractTextFromHTML(text)) AS t
FROM hackernews_history ORDER BY time DESC LIMIT 10 \G
# Download everything as Parquet/JSON/CSV...
SELECT * FROM hackernews_history INTO OUTFILE 'dump.parquet'
ClickHouse is a single binary. It can be invoked as clickhouse-server, clickhouse-client, and clickhouse-local. The help is available as `clickhouse-local --help`. clickhouse-local also has a shorthand alias, `ch`.
This binary is packaged inside .deb, .rpm, and .tgz, and it is also available for direct download. The curl|sh script selects the platform (x86_64, aarch64 x Linux, Mac, FreeBSD) and downloads the appropriate binary.
If you are interested in network monitoring in Kubernetes, it's worth looking at Kubenetmon: https://github.com/ClickHouse/kubenetmon - an open-source eBPF-based implementation from ClickHouse.
The article should mention ClickHouse, which fixes almost all the mentioned problems.
It has idempotency tokens for INSERTs, it has type-safe prepared statements that are easy to use, and the settings can be passed at the query time to avoid the session state.
reply