Hacker News new | past | comments | ask | show | jobs | submit login
Lord of the io_uring: io_uring tutorial, examples and reference (unixism.net)
267 points by shuss on May 10, 2020 | hide | past | favorite | 74 comments



One thing this writeup made me realize is, if I have a misbehaving I/O system (NFS or remote block device over a flaky network, dying SSD, etc.), in the pre-io_uring world I'd probably see that via /proc/$pid/stack pretty clearly - I'd see a stack with the read syscall, then the particular I/O subsystem, then the physical implementation of that subsystem. Or if I looked at /proc/$pid/syscall I'd see a read call on a certain fd, and I could look in /proc/$pid/fd/ and see which fd it was and where it lived.

However, in the post-io_uring world, I think I won't see that, right? If I understand right, I'll at most see a call to io_uring_enter, and maybe not even that.

How do I tell what a stuck io_uring-using program is stuck on? Is there a way I can see all the pending I/Os and what's going on with them?

How is this implemented internally - does it expand into one kernel thread per I/O, or something? (I guess, if you had a silly filesystem which spent 5 seconds in TASK_UNINTERRUPTIBLE on each read, and you used io_uring to submit 100 reads from it, what actually happens?)


I think that's a very reasonable concern. It however isn't really about io_uring - it applies to all "async" solutions. Even today if you are running async IO in userspace (e.g. using epoll), it's not very obvious where something went wrong, because no task is seemingly blocked. If you attach a debugger, you might most likely see something being blocked on epoll - but a callstack to the problematic application code is nowhere in sight.

Even if pause execution while inside the application code there might not be a great stack which contains all relevant data. It will only contain the information since the last task resumption (e.g. through a callback). Depending on your solution (C callbacks, C++ closures, C# or Kotlin async/await, Rust async/await) the information will be between not very helpful and somewhat understandable, but never on par with a synchronous call.


> Even today if you are running async IO in userspace (e.g. using epoll), it's not very obvious where something went wrong, because no task is seemingly blocked.

It doesn't apply to file IO, which is never non-blocking, and can't be made async with epoll. Epoll always considers files ready for any IO. And if the device is slow, the thread is blocked with dreaded "D" state.


The fundamental problem is that readiness based async IO and random access to not mix well. You'd need a way to poll readiness for different positions in the same file at the same time.

Completion based async (including io_uring on Linux or IO completion ports on Windows) doesn't suffer from this problem.


> It will only contain the information since the last task resumption

That's an implementation detail though. As far as I'm aware python keeps hold of the stack, so it outputs complete stack traces as you'd expect from synchronous code.


You would want to start using the more modern debugging tools, namely dynamic tracing tools like bpftrace[1]. Though in fairness, it might be a tad tricky to get a trace for a specific file without some more complicated scripts.

[1]: https://github.com/iovisor/bpftrace


This is such a great point. Never thought how async I/O could be a problem this way. In the SQ polling example, I used BPF to "prove" that the process does not make system calls:

https://unixism.net/loti/tutorial/sq_poll.html

Could be a good idea to use BPF to expose what io_uring is doing. Just a wild thought.


Good point. Would be great if the submission and completion ring buffers were accessible via procfs.


Could eBPF be used? I'm really not sure myself.


Use timeouts?


How exactly? I/O in TASK_UNINTERRUPTIBLE/TASK_KILLABLE cannot be timed out - so part of my question is how io_uring handles that in general.


If it's just blocked, you could probably look at the io_uring kthreads. But as I mentioned in another comment, bpftrace is probably a more useful tool for things like this (and it's useful for general kernel debugging too!).


There are some benchmarks that show io_uring as a significant boost over aio: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5....

I see that nginx accepted a pull request to use it, mid last year: https://github.com/hakasenyang/openssl-patch/issues/21

Curious if it's also been adopted by other popular IO intensive software.


Oh, yeah. QEMU 5.0 already uses io_uring. In fact, it uses liburing. Check out the changelog: https://wiki.qemu.org/ChangeLog/5.0


To save people time, there's a single reference to it on the changelog:

> The file-posix driver can now use the io_uring interface of Linux with aio=io_uring

side note: I did note a change we built made it in to a released version of qemu:

> qemu-img convert -n now understands a --target-is-zero option, which tells it that the target image is completely zero, so it does not need to be zeroed again.

That's saving us so much time and I/O


Echo server benchmarks, io_uring vs epoll: https://github.com/frevib/io_uring-echo-server/blob/io-uring...


Nice, reading through the epoll implementation shouldn't it re-register to make sure the send() call won't block? Looks like it only non-blocks on the accept socket and then a single read register


Could elaborate on “re-register”? It does not do short writes, if that’s what you mean.


Normally I'd expect an epoll implementation to epoll_ctl to make sure the socket can be written to without blocking. In this benchmark it probably makes no difference but I would think it would make the results a little more inline with a real applications usage of epoll.


This a bare minimal echo server for educational purposes. It is not inline with a real world event loop.


Yeah I understand that, but if you are going for identical performance characteristics of how epoll would normally be used then I would expect it to re-register. Thats all I was getting at.


I have not adopted io_uring yet because it isn't clear that it will provide useful performance improvements over linux aio in cases where the disk I/O subsystem is already highly optimized. Where io_uring seems to show a benefit relative to linux aio is more naive software design, which adds a lot of value but is a somewhat different value proposition than has been expressed.

For software that is already capable of driving storage hardware at its theoretical limit, the benefit is less immediate and offset by the requirement of having a very recent Linux kernel.


For regular files, aio works async only if they are opened in unbuffered mode. I think this is a huge limitation. io_uring on the other hand, can provide a uniform interface for all file descriptors whether they are sockets or regular files. This should be a decent win, IMO.


That was kind of my point. While all of this is true, these are not material limitations for the implementation of high-performance storage engines. For example, using unbuffered file descriptors is a standard design element of databases for performance reasons that remain true.

Being able to drive networking over io_uring would be a big advantage but my understanding from people using it is that part is still a bit broken.


The ScyllaDB developers wrote up their take here: https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-wi...


Those benchmark results are pretty impressive. In particular, io_uring gets the best performance both when the data is in the page cache and when bypassing the cache.


True. Have to agree, here. Although one advantage over aio for block I/O that io_uring will still have is to use polling mode to almost completely avoid system calls.


And works is used advisedly. Certain filesystem edge conditions particularly metadata changes due to block allocation can result in blocking behavior.


If I understand the premise right, it should be fewer syscalls per IO. So even if it doesn't improve disk I/O, it might reduce CPU utilization.


It could also be nearly zero system calls per I/O operation. The kernel can poll the submission queue for new entries. This eliminates system call overhead at the cost of higher CPU utilization.


This is true, but for most intentionally optimized storage engines that syscall overhead is below the noise floor in practice, even on NVMe storage. A single core can easily drive gigabytes per second using the old Linux AIO interface.

It appears to primarily be an optimization for storage that was not well-optimized to begin with. It is not obvious that it would make high-performance storage engines faster.


aio doesn't have facilities for synchronous filesystem metadata operations like open, rename, unlink, etc. If your workload is largely metadata-static, aio is ok. If you need to do even a little filesystem manipulation, io_uring seems like it can provide some benefits.


Samba can optionally use it if you explicitly load the vfs_io_uring module, but it exposed a bug for us (see my comment above). We're fixing it right now.


A comment from the cat example:

>/* For each block of the file we need to read, we allocate an iovec struct which is indexed into the iovecs array. This array is passed in as part of the submission. If you don't understand this, then you need to look up how the readv() and writev() system calls work. */

I have to say, I don't really understand why the author chose to individually allocate (up to millions of) single kilobyte buffers for each file. Perhaps there is a reason for it, but I think they should elaborate the choice. Anyway, I guess the first example is too simplified, which is why what follows after is not built on top of it in any way, hence they feel disjointed.

The bigger problem here is that I don't know the author, or how talented they are. Choices like that, or writing non-async-signal-safe signal handlers don't help in estimating it, either. Is the rest of the advice sound?


The author here: All examples in the guide are aimed at throwing light at the io_uring and liburing interfaces. They are not very useful or very real-worldish examples. The idea with this example in particular is to show the difference how readv/writev work synchronously vs how they would be "called" io_uring. May be I should call out the fact that these programs are more tuned towards explaining the io_uring interface a lot more in the text. Thanks for the feedback.


So awesome... The ring buffer is like a generic asynchronous system call submission mechanism. The set of supported operations is already a subset of available Linux system calls:

https://github.com/torvalds/linux/blob/master/include/uapi/l...

It almost gained support for ioctl:

https://lwn.net/Articles/810414/

Wouldn't it be cool if it gained support for other types of system calls? Something this awesome shouldn't be restricted to I/O...


The author seems to be planning to expand it to be usable as a generic way of doing asynchronous syscalls


Anyone familiar with the Infiniband's approach to exposing IO via rx/tx queues [0] comment whether it seems similar to io_uring's ring-buffers [1]? How do these contrast against each other?

[0] https://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-10...

[1] https://news.ycombinator.com/item?id=19846261


Very limited experience with Infiniband but it seems similar, a bit more flexible (esp recently with more syscalls supported).

Also similar to but more general than RIO Sockets of Win8+:

https://docs.microsoft.com/en-us/previous-versions/windows/i...


The site pushes really hard that you shouldn't use the low-level system calls in your code and that you should (always?) be using a library (liburing).

What exactly is liburing bringing to the table that I shouldn't be using the uring syscalls directly?


You absolutely can use system calls in your code. The kernel has an awesome header that makes this easy and allows you to eliminate all dependencies:

https://github.com/torvalds/linux/blob/master/tools/include/...

This system call avoidance dogma exists because libraries generally have more convenient interfaces and are therefore easier to use. They're not strictly necessary though.

It should be noted that using certain system calls may cause problems with the libraries you're using. For example, glibc needs to maintain complete control over the threading model in order to implement thread-local storage. By issuing a clone system call directly, the glibc threading model is broken and even something simple like errno is likely to break.

In my opinion, libraries shouldn't contain thread-local or global variables in the first place. Unfortunately, the C language is old and these problems will never be fixed. It's possible to create better libraries in freestanding C or even freestanding Rust but replacing what already exists is a lifetime of work.

> What exactly is liburing bringing to the table that I shouldn't be using the uring syscalls directly?

It's easier to use compared to the kernel interface. For example, it handles submission queue polling automatically without any extra code.


The raw io_uring interface, once you ignore the boilerplate initialization code, is actually a super-simple interface to use. liburing is itself only a very thin wrapper on top of io_uring. I feel that if you ever used io_uring, after a while you'll end up with a bunch of convenience functions. liburing looks more like a collection of those functions to me today.

One place where a slightly high-level interface is provided by liburing is in the function io_uring_submit(). It determines among other things if there is a need to call the io_uring_enter() system call depending on whether you are in polling mode or not, for example. You can read more about it here:

https://unixism.net/loti/tutorial/sq_poll.html

Otherwise, at least at this time, liburing is a simple wrapper.


io_uring requires userspace to access it using a well-defined load/store memory ordering. Care must be taken to make sure the compiler does not reorder instructions but also to use the correct load/store instructions so hardware doesn't reorder loads and stores. This is easier to (accidentally) get correct on x86 as it has stronger ordering guarantees. In other words, if you are not careful your code might be correct on x86 but fail on Arm, etc. Needless to say the library handles all of this correctly.


io_uring still has its wrinkles.

We are scrambling right now to fix a problem due to change in behavior exposed to user-space from the io_uring kernel module in later kernels.

Turns out that in earlier kernels (Ubuntu 19.04 5.3.0-51-generic #44-Ubuntu SMP) io_uring will not return short reads/writes (that's where you ask for e.g. 8k, but there's only 4k in the buffer cache, so the call doesn't signal as complete and blocks until all 8k has been transferred). In later kernels (not sure when the behavior changed, but the one shipped with Fedora 32 has the new behavior) io_uring returns partial (short) reads to user space. e.g. You ask for 8k but there's only 4k in the buffer cache, so the call signals complete with a return of only 4k read, not the 8k you asked for.

Userspace code now has to cope with this where it didn't before. You could argue (and kernel developers did :-) that this was always possible, so user code needs to be aware of this. But it didn't used to do that :-). Change for user space is bad, mkay :-).


I know nothing about io_uring but looking at the man page[1] of readv I see it returns number of bytes read. For me as a developer that's an unmistakable flag that partial reads is possible.

Was readv changed? The man page also states that partial reads is possible, but I guess that might have been added later?

If it always returned bytes read, it would hardly be the first case where the current behavior is mistaken for the specification. My fondest memory of that is all the OpenGL 1.x programs that broke when OpenGL 2.x was released.

[1]: http://man7.org/linux/man-pages/man2/readv.2.html


Also, note the preadv2 man page which has a flags field with one flag defined as:

-------------------------------

RWF_NOWAIT (since Linux 4.14) Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read.

-------------------------------

This implies that "standard" pread/preadv/preadv2 without that flag (which is only available for preadv2) will block waiting for all bytes (or short return on EOF) and you need to set a flag to get the non-blocking behavior you're describing here. Otherwise the flag would be the inverse - RWF_WAIT, implying the standard behavior is the non-blocking one, not the blocking one.

The blocking behavior is what we were expecting (and previously got) out of io_uring, so it was an unpleasant surprise to see the behavior change visible to user-space in later kernels.


> If this flag is specified, the preadv2() system call will return instantly if it would... wait for a lock.

Doesn't this sound a bit different from ordinary short reads?

Receiving EAGAIN usually happens under fairly specific conditions (signal interruption), but I'd imagine, that filesystem code has a great deal of locks.

For example, FUSE filesystems can support signal interruptions via EAGAIN, but they are not guaranteed to. You can end up in situation, when FUSE filesystem hangs, and you can not interrupt the thread, which reads from it. I suspect, that RWF_WAIT is a "fix" for similar situations and not the opposite of default behavior.


Well pread/pwrite have the same return values, and historically for disk reads they block or return a device error.

pread only returns a short value on EOF.


Well, the man page does say that "The readv() system call works just like read(2) except that multiple buffers are filled".

If we go to read(2) we find "It is not an error if [the return value] is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now [...], or because read() was interrupted by a signal."

As an outsider, I'd never rely on this returning the requested number of bytes. If I required N bytes, I'd write use a read loop.

But I do agree that the RWF_NOWAIT flag mentioned in your other comment doesn't help, as it suggests the default is to block.


Well, or EINTR if your signal handlers are not SA_RESTART.


For EINTR it never returns a short read, as the only way to see EINTR is a return of -1 with errno==EINTR.

We handle signals fine.


Sure, I was imprecise. A signal can cause a read to return a short result.


It was really interesting how this was found.

A user started describing file corruption when copying to/from Windows with the io_uring VFS module loaded.

Tests using the Linux kernel cifsfs client and the Samba libsmbclient libraries/smbclient user-space transfer utility couldn't reproduce the problem, neither could running Windows against Samba on Ubuntu 19.04.

What turned out to be happening was a combination of things. Firstly, the kernel changed so an SMB2_READ request against Samba with io_uring loaded was sometimes hitting a short read, where some of file data was already in the buffer cache, so io_uring now returned a short read to smbd.

We returned this to the client, as in the SMB2 protocol it isn't an error to return a short read, the client is supposed to check read returns and then re-issue another read request for any missing bytes. The Linux kernel cifsfs client and Samba libsmbclient/smbclient did this correctly.

But it turned out that Windows10 clients and MacOSX Catalina (maybe earlier versions of clients too, I don't have access to those) clients have a horrible bug, where they're not checking read returns when doing pipeline reads.

When trying to read a 10GB file for example, they'll issue a series of 1MB reads at 1MB boundaries, up to their SMB2 credit limit, without waiting for replies. This is an excellent way to improve network file copy performance as you fill the read pipe without waiting for reply latency - indeed both Linux cifsfs and smbclient do exactly the same.

But if one of those reads returns a short value, Windows10 and MacOSX Catalina DON'T GO BACK AND RE-READ THE MISSING BYTES FROM THE SHORT READ REPLY !!!! This is catastrophic, and will corrupt any file read from the server (the local client buffer cache fills the file contents I'm assuming with zeros - I haven't checked, but the files are corrupt as checked by SHA256 hashing anyway).

That's how we discovered the behavior and ended up leading back to the io_uring behavior change. And that's why I hate it when kernel interfaces expose changes to user-space :-).


> in the SMB2 protocol it isn't an error to return a short read, the client is supposed to check read returns and then re-issue another read request for any missing bytes

This is interesting and somewhat surprising, since Windows IO is internally asynchronous and completion based, and AFAIK file system drivers are not allowed to return a short read except for EOF.

And actually, even on Linux file systems are not supposed to return short reads, right? Even on signal? Since user apps don't expect it? (And thus it's not surprising that io_uring's change broke user apps.)

So it wouldn't be surprising to learn that the Windows SMB server never returns short reads, and thus it's interesting that the protocol would allow it. Do you know what the purpose of this is?


Obviously the Windows SMB server never returns short reads, otherwise this bug would never have made it out of Redmond or Cupertino.

On Linux, pread also never returns short reads against disk files if the bytes are available, which is why no one noticed this client bug as our default io backend is a pthread-pool that does pread/pwrite calls. It only happens when someone tried our (flagged as experimental thank god) vfs_io_uring backend.

Yeah the protocol even has a field in the SMB2_READ request called MinimumBytes, for which the server should fail the read if less than these bytes are available on return. The Windows 10 clients set this to zero :-). The MacOSX Catalina client sets it to 1. So yes, the clients are supposed to be able to handle short reads.


Out of curiosity, I took a look at how the MinimumBytes (actually MinimumCount) field is used by the Windows SMB server. Interestingly, it fails with STATUS_END_OF_FILE if the actual bytes read is less than MinimumCount, which suggests to me that this is supposed to be a minimum on the (remaining) file length, not on the number of bytes that the server is able to return at the moment.

I can't find any history of MinimumCount being used in the RTM version of any Windows SMB client, so without deeper archeology the reason this field was introduced remains a mystery to me.

Regardless, I agree that the client should validate the returned byte count. But (only having thought about this briefly), I do not think a client should retry in this case--it seems to me if the client sees a short read, it can assume that the read was short because the read reached EOF (which may have changed since the file's length was queried).


Sorry to keep laboring the point :-) but the other reason I'm pretty sure this is a client bug is that the client doesn't truncate the returned file at the end of the short read, which you'd expect if it actually was treating short read as EOF.

If you copy a 100mb file and the server returns a short read somewhere in the middle of the read stream the file size on the client is still reported as 100mb, which means file corruption as the data in the client copy isn't the same as what was on the server.

That's how this ended up getting reported to us in the first place.


Yes, that's a good point. I agree that there appears to be a client bug here. From a quick glance, it appears that nothing is checking that the non-final blocks in a pipelined read are returned from the server in full.

I don't necessarily agree that retry is the right behavior though. Wouldn't that result in an extra round trip in the actual EOF case? Again, not having thought about this much, it seems a more efficient interpretation of the spec is that truncated reads indicate EOF. In that case, a truncated read as in the middle of a pipelined operation either indicate the file's EOF is moving concurrently with the operation (in which case stopping at the initial truncation would be valid) or the lease has been violated.

Regardless, I work on SMB-related things only peripherally, so I do not represent the SMB team's point of view on this. Please do follow up with them.


It's only an extra round trip in the case of an unexpected EOF. File size is returned from SMB2_CREATE and so given the default of a RHW lease then (a) the lease can't be violated - if it is, then all bets are off as the server let someone modify your leased file outside the terms of the lease. Or (b) you know the file size, so a short read if you overlap the actual EOF is expected and you can plan for it.

A short read in the middle of what you expect to be a continuous stream of bytes should be treated as some sort of server IO exception (which it is) and so an extra round trip to fetch the missing bytes returning 0, meaning EOF and something truncated or an error such as EIO meaning you got a hardware error isn't so onerous.

After all this is a very exceptional case. Both Steve's Linux cifsfs client and libsmbclient have been coded up around these semantics (re-fetching missing bytes to detect unexpected EOF or server error) and I'd argue this is correct client behaviour.

As I said, given the number of clients out there that have this bug we're going to have to fix it server-side anyway, but I'm surprised that this expected behavior wasn't specified and tested as part of a regression suite. It certainly is getting added to smbtorture.


Whenever a client gets a short read it needs to issue a request at the missing offset if the caller wanted more bytes. Only if the server returns zero on that read can it assume EOF and concurrent truncation.

We're going to have to fix the Samba server to never return short reads when using io_uring because the clients with this bug are already out there. But if what you're saying is how Microsoft expects the protocol to operate then it needs to be documented in MS-SMB2 because I don't think it's specified this way at the moment.


No the client can't assume that. Consider pipelining reads. You can asynchronously send 10 1MB reads. The server can return the data in any order. So the read sent at offset 0 could return last after the server has already returned 9MB starting at offset 1MB onwards in the file, and this first read then returns a short read of 800k instead of 1MB.

You can't then assume that the read at offset 0 returning short means the file is now truncated to 800MB and the other 9MB is no longer of use.

Also remember you might have a complete RWH lease on the file, so you are guaranteed that there was no other writer truncating the file whilst the read is ongoing.


So it's a Windows and MacOS bug then? Ie, no shadow should fall on io_uring really?

That said, nice dig figuring this out. These type of bugs can be really frustrating to round up.


It is a Windows and MacOS bug, but only visible with the io_uring change in behavior.

Problem is we have to fix the server, as even when buggy, one billion clients can never be wrong.


Well, Linus did have an infamous rant about never breaking userspace; it's surprising this happened.


He's not especially consistent about it. Linus was totally prepared to break userspace re: getrandom() in recent history.


Is there any intention to optimize work done, rather than just the calling interface?

E.g., running an rsync if a 10m files hierarchy usually requires 10m synchronous stat calls. Using io-uring would make them asynchronous, but they could potentially be done more efficiently (e.g. convert file names to inodes in blocks of 20k, and then stat those 20k inodes in a batch).

That would require e.g. the VFS layer to support batch operations. But the io-uring would actually allow that without a user space interface change.


Maybe I just missed this but can anyone tell me what kernel versions support io_uring. I ran the following test program on 4.19.0 and it is not supported:

    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/utsname.h>
    #include <liburing.h>
    #include <liburing/io_uring.h>


    static const char *op_strs[] = {
      "IORING_OP_NOP",
      "IORING_OP_READV",
      "IORING_OP_WRITEV",
      "IORING_OP_FSYNC",
      "IORING_OP_READ_FIXED",
      "IORING_OP_WRITE_FIXED",
      "IORING_OP_POLL_ADD",
      "IORING_OP_POLL_REMOVE",
      "IORING_OP_SYNC_FILE_RANGE",
      "IORING_OP_SENDMSG",
      "IORING_OP_RECVMSG",
      "IORING_OP_TIMEOUT",
      "IORING_OP_TIMEOUT_REMOVE",
      "IORING_OP_ACCEPT",
      "IORING_OP_ASYNC_CANCEL",
      "IORING_OP_LINK_TIMEOUT",
      "IORING_OP_CONNECT",
      "IORING_OP_FALLOCATE",
      "IORING_OP_OPENAT",
      "IORING_OP_CLOSE",
      "IORING_OP_FILES_UPDATE",
      "IORING_OP_STATX",
      "IORING_OP_READ",
      "IORING_OP_WRITE",
      "IORING_OP_FADVISE",
      "IORING_OP_MADVISE",
      "IORING_OP_SEND",
      "IORING_OP_RECV",
      "IORING_OP_OPENAT2",
      "IORING_OP_EPOLL_CTL",
      "IORING_OP_SPLICE",
      "IORING_OP_PROVIDE_BUFFERS",
      "IORING_OP_REMOVE_BUFFERS",
    };


    int main() {
      struct utsname u;
      uname(&u);

      struct io_uring_probe *probe = io_uring_get_probe();
      if (!probe) {
        printf("Kernel %s does not support io_uring.\n", u.release);
        return 0;
      }

      printf("List of kernel %s's supported io_uring operations:\n", u.release);

      for (int i = 0; i < IORING_OP_LAST; i++ ) {
        const char *answer = io_uring_opcode_supported(probe, i) ? "yes" : "no";
        printf("%s: %s\n", op_strs[i], answer);
      }

      free(probe);
      return 0;
    }


If you have a clone of the Linux kernel source tree, you just have to look at the history of the include/uapi/linux/io_uring.h file. From a quick look here: everything up to IORING_OP_POLL_REMOVE came with Linux 5.1; IORING_OP_SYNC_FILE_RANGE was added in Linux 5.2; IORING_OP_SENDMSG and IORING_OP_RECVMSG came with Linux 5.3; IORING_OP_TIMEOUT with Linux 5.4; everything up to IORING_OP_CONNECT is in Linux 5.5; everything up to IORING_OP_EPOLL_CTL is in Linux 5.6; and the last three are going to be in Linux 5.7.


This article concurs. https://lwn.net/Articles/810414/ io_uring was first added to the mainline Linux kernel in 5.1.


It is documented in the liburing man pages.

Furthermore, recent variants of io_uring have a probe-function that allows checking for capabilities.

Generally speaking though, you will need more recent kernels than 4.x


io_uring_get_probe() needs v5.6 at least.


Question: how does one detect socket push back using io_uring? For example, with libc "write/writev" for non-blocking socket would return less bytes than requested and allow code to poll for write readiness before writing more. This is quite useful to handle scenarios where there are impedance mismatches between processing speed and ability to send data over a network, e.g. processing needs to observe push back and handle it appropriately. Apologies: I posted this question to twitter before I read the redirect here.


By coincidence I asked a few questions on the mailing list about io_uring this morning: https://lore.kernel.org/io-uring/20200510080034.GI3888@redha...


Unfortunately I misread the title as “Lord of the Urine” and... was concerned.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: