io_uring and seccomp

submited by
Style Pass
2024-10-09 15:00:06

Recent Linux kernels have the kqueue-alike io_uring interface for asynchronous I/O. Instead of making read and write syscalls, you write batches of I/O requests to a circular buffer in userland called the submission queue, and make a io_uring_enter syscall to submit them to the kernel. Instead of making individual syscalls, io_uring submission queue entries (SQEs) take an opcode for the specific I/O operation they're performing, and that's mapped to the same kernel code that normally services the syscall. You can read the results off another buffer called the completion queue without making additional syscalls to the kernel. This can meaningfully improve I/O performance, especially in the face of Spectre/Meltdown mitigations.

A side effect is that io_uring effectively bypasses the protections provided by seccomp filtering — we can't filter out syscalls we never make! This isn't a security vulnerability per se, but something you should keep in mind if you have especially paranoid seccomp rules. Practically speaking it's going to be rare that anything I/O related is going to be seccomp filtered, but I thought it was interesting enough to reproduce myself.

Suppose we want to prevent our application from making outbound network requests by blocking the connect(2) syscall. This is a contrived example as you'd most likely implement this via network namespaces or iptables. But let's imagine the application needs to look up an upstream address and connect to it once, but we want to ensure the application can never make any new connections after that.

Leave a Comment
Related Posts