Looks really cool, but in production systems, won't the trace files proliferate ...

alehander42 · 2025-03-06T15:52:56 1741276376

We are also planning to develop a distributed tracing platform, similar to Jaeger and OpenTelemetry, that continuously records the execution of many distributed processes (e.g. micro-services).

Unlike the existing platforms, which capture only message flows and require you to make educated guesses when some anomaly is observed, our system will let you accurately replay the processing code for each message to quickly identify the root cause for the anomaly.

This would rely on our ability to jump to the specific moment in time when a certain incoming message starts being processed. This moment can be identified either by a log line with a specific format or by a call to some special tracking function (e.g. track_incoming_message(request_id)).

For the system languages, the RR[1] recordings try to be practical by capturing only the non-deterministic events in the program execution. You can pair this with a ring buffer that discards the data after a certain retention period.

For the scripting languages(or any implementation using the db-like traces) we might add some advanced record filtering options.

(But maybe we are misunderstanding the question?)

1: https://rr-project.org/

Veserv · 2025-03-07T01:41:53 1741311713

You can not just discard the oldest data of a long-running execution trace when doing replay-based time-travel debugging.

You can not replay execution without a known state followed by all non-determinism after that state which is most easily done by starting from the initial state. To discard data, you need to manifest a state snapshot corresponding to that time to enable forward reconstruction from that state.

alehander42 · 2025-03-07T19:15:18 1741374918

you're right: in the RR case: currently this is not merged yet, but a RR contributor works on persistent checkpoints; they can act as snapshots

kreco · 2025-03-06T15:53:06 1741276386

Especially since the trace files are in .json. [0]

[0] https://github.com/metacraft-labs/runtime_tracing#format

alehander42 · 2025-03-06T18:03:42 1741284222

True! The next major version of the format should use a more optimized format, as mentioned.

However, some of the important optimizations, that we're preparing are not related so much to the format, but to record more specific things and reconstruct more in the postprocessing.