Stuff Your Logs

ChrisFoster · on March 19, 2021

This is really neat. In the past I've used similar techniques to decode binary data from a third party lidar system in parallel. In a way that the manufacturers probably didn't intend or expect.

The system generated large data files which we wanted to process in parallel without any pre-indexing. It turned out that these streams contained sync markers which were "unlikely" to occur in the real data, but there wasn't any precise framing like COBS. Regardless, the markers and certain patterns in the binary headers were enough to synchronize with the stream with a very high degree of reliability.

So for parallel processing we'd seek into the middle of the file to process a chunk of data, synchronize with the stream, and process all subsequent lidar scanlines which started in that chunk. Exactly the algorithm they describe here.

Amusingly this approach gave reasonable results even in the presence of significant corruption where the manufacturer's software would give up.

c0l0 · on March 19, 2021

Having skimmed the article (because $dayjob and all), I wonder how/if their scheme can cope with write(2) producing a short write, with not all the data in the buffer being atomically committed to their POSIX-compliant backing store?

I don't see any mechanism described that makes sure that never happens (by forcing records not to exceed a given length that can always be written atomically - which I am not sure even exists...), so I am wondering how often that kind of thing even happens on contemporary systems, and - if it does - how often they wreck a good number of stored records that way.

CBLT · on March 19, 2021

I'm assuming you're talking about https://www.notthewizard.com/2014/06/17/are-files-appends-re...

It seems they have a max size of 512 bytes[0] which they reject[1] if the encoded length exceeds.

[0] https://github.com/backtrace-labs/stuffed-record-stream/blob...

[1] https://github.com/backtrace-labs/stuffed-record-stream/blob...

st_goliath · on March 19, 2021

> I'm assuming you're talking about https://www.notthewizard.com/2014/06/17/are-files-appends-re...

I just glossed over this article and found this gem:

> So there you have it, empirical proof that Linux allows atomic appends of 4KB.

No, this means that the specific example presented here worked on the specific Linux version the author used, on the specific filesystem, hardware, .... that the author used. As long as you cannot point out the code path that is responsible for 4k atomic writes, this proves nothing. Even some of the comments under that article point out that it doesn't work on their system.

Sadly, this is a common question that semi-regularly crops up in IRC channels that I'm on. "Under conditions <such-and-such>, can I get away with assuming write(2) is atomic and not check the return value?"

IMO if you want to write robust and portable programs, you should never ever make such assumptions and stick strictly to what the documentation says. And the documentation about write(2) says that it can return less than you passed in and it can fail under a number of conditions (e.g. interrupted by a signal). The implementation corner cases that you rely on may change between versions or even across filesystems.

Why is it so hard to read the documentation and write code in a mindset that everything the documentations says could potentially go wrong, ...you know..., could go wrong?

If write returns and the result is positive, but less than what you wanted, and you really badly want it written, advance the data pointer, subtract the size and retry. If the result is negative, check if it was something like EINTR and retry, otherwise fail. And even then, when this is done, don't make any assumptions about the data being on disk or sent over the network. It may still be in a kernel buffer and the documentation gives you no such guarantees either.

In fact, for the IRC discussions I eventually hacked together a small wrapper function[1] that took me less time than writing this response. Yes, you may think up a use case where this poses problems for you, but I'm not trying to solve somebodys specific problem, that's left for them to do. I'm just trying to make a point.

[1] https://gist.github.com/AgentD/09c3ecbcbfb07e08fb8c2550db651...

dataflow · on March 19, 2021

> Why is it so hard to read the documentation and write code in a mindset that everything the documentations says could potentially go wrong, ...you know..., could go wrong?

I'm with you on typical instances of this, such as incomplete writes (which you point out), but didn't you start the discussion asking about atomicity rather than incomplete writes? A guarantee of full writes (and success on each one) still wouldn't guarantee atomicity. Atomicity is something you can't get with a wrapper like that; it needs to be baked into the implementation all the way down to the hardware. So if you can't rely on it in the implementation then that drastically affects your ability to write robust and performant software for it.

st_goliath · on March 19, 2021

Yes, you are right about the atomicity, if that is the desired goal.

My argument is from a view point that the people in question want to achieve a full write of all the data, but then don't bother implementing it, relying on a non-existent atomicity guarantee instead.

pkhuong · on March 19, 2021

Short writes to (local) filesystems are rare in practice, unless you run into kernel limits (e.g., Linux stops a couple KB short of 2GB). When they do happen, it's often due to issues that are hard to recover from, like failing media or lack of disk space.

You're right that short writes can cause data loss; however, self-synchronisation guarantees that only the two records immediately adjacent to the short write can be affected (with a reasonable implementation, at most one... and if you prefix each record with the 2-byte separator, only if the write is cut short after the sepearator's first byte).

anarazel · on March 19, 2021

I see short writes pretty regularly under memory pressure. For writes well under 2GB. With io_uring there are extremely common (but got less so in very recent kernel versions). For different reasons for direct and buffered IO IIRC.

anarazel · on March 19, 2021

Oh, aber not just memory pressure, a busy IO stack too. When there's multiple requests in flight already (concurrent processes, AIO, writeback) a part of the write may be issued as a separate (block layer) request, and IIRC there's a chance to get interrupted before sending the next parts.

pkhuong · on March 19, 2021

I can see how it might be different with io_uring; I've not observed short writes with O_APPEND and regular buffered I/O.

mzs · on March 19, 2021

Yeah that's wrong, if the writer gets a signal the write can be short for example. The only way to have any guarantee like this is with a pipe and respecting PIPE_BUF.

pkhuong · on March 20, 2021

That's an unfortunately common misconception supported by the wording in POSIX, probably because no one wants to codify ill-defined practices.

I remember back in the 00s, people would refer to the implementation of the write syscall in some old BSD to explain why short writes are expected to only happen in actual error conditions; IIRC, local disk I/O was considered "fast" and thus the syscall only checked for signals during setup, and while returning to userspace. Around that time, I came to realise that Linux doesn't cut write(2) short due to signals. antirez https://news.ycombinator.com/item?id=3790154 confirms that was still the case in 2012.

This expectation of no short write to local files outside real errors has spread to too many programs, so must be upheld by any useful POSIX-compatible OS (https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShort...). However, there are no hard rules for when short writes to files can and cannot happen (e.g., I'm pretty sure NFS is a common exception), and programs can be portable enough as long without knowing them.

In contrast, it's easy to exceed reasonable limits for atomic pipe I/O, so it makes sense that POSIX would codify existing practices. Unfortunately, a vague agreement of "no shenanigans on regular file writes" doesn't fit anywhere in a specification, so people lacking the context end up thinking that only pipes can be expected to provide uninterrupted writes, while the reality is that only pipes broke writes regularly enough that it made sense to specify when and how that can happen.

mzs · on March 23, 2021

The person behind Boost.AFIO wrote a test and determined that ext4 is only atomic for one byte at a time. O_DIRECT with its perf/portability considerations was needed too.

https://stackoverflow.com/a/35258623

pkhuong · on March 23, 2021

That's for concurrent reads, where the "atomicity" guarantees as defined in POSIX don't mean what anyone expects.

mzs · on March 24, 2021

Yes I misinterpreted that, thanks. I wonder if a later spec requires sector atomic writes because I always worry something will be optimized later while still meeting reqs. I thought this had happened for XFS on linux but the sector guarantee remained )(only the stronger SGI behavior was dropped to reduce latency) - or how now more C compiler optimizations trip-up people inadvertently hitting UB. It really does seem 256 bytes like how used in this HN post is safe on all common systems, but I'd still be afraid it could break unless I see it spelled-out in a spec and man page.

codetrotter · on March 19, 2021

This is a wonderful article! This is exactly what I need, I can’t wait to implement this! Thank you both author and person that posted this ^^

ignoramous · on March 19, 2021

Read also an even more advanced log-pipeline that preserves logs across kernel panics (!) Implemented and open sourced by the Google Fiber team: https://apenwarr.ca/log/20190216

codetrotter · on March 19, 2021

Oh my that’s some incredibly nice stuff right there. Thank you very much for the link! Definitely going to be influenced a lot in the thing I am working on now from having read this :D

mef · on March 19, 2021

site appears to be down... archive link https://archive.ph/https://pvk.ca/Blog/2021/01/11/stuff-your...

sbahra · on March 19, 2021

Also up on https://engineering.backtrace.io/2021-01-11-stuff-your-logs/