When “idle” isn't idle: how a Linux kernel optimization became a QUIC bug

neuralkoi · 2026-05-13T05:35:05 1778650505

I can see why they rewrote QUIC in Rust and for use in userspace, though going the in-house approach would warrant keeping an eye on the relevant kernel commits like a hawk to avoid missing bug fixes like these. These in-house implementations tend to have less eyeballs than the kernel.

I found it interesting that Cloudflare is not yet using BBR as the default in quiche. CUBIC's recovery in this day and age, and especially in datacenters with large pipes, seems so slooow to me. Almost two seconds with no loss whatsoever till achieving BDP again and then shooting itself in the foot every time it hits the ceiling. Each one of those losses a retransmission.

vasilvv · 2026-05-13T11:28:40 1778671720

> though going the in-house approach would warrant keeping an eye on the relevant kernel commits like a hawk to avoid missing bug fixes like these. These in-house implementations tend to have less eyeballs than the kernel.

This is somewhat funny to read because this specific issue in CUBIC (sudden CWND jump upon existing quiescence) was originally discovered in Google's QUIC library and then later reported to the team working on the TCP stack. I know this because I was the one who found that bug back in 2015.

That said, congestion control algorithms are really prone to logic bugs, and very subtle changes in the algorithm can often lead to dramatically different outcomes. Because of that, there's a lot of value in running congestion control code that has been tested on a wide variety of real Internet traffic.

masklinn · 2026-05-13T09:01:53 1778662913

> I can see why they rewrote QUIC in Rust and for use in userspace

As far as I know, while they might have either way, they did not ("rewrite QUICK [...] for use in userspace"): the linux kernel implementation only landed late 2025. Quiche was started ca 2018 (that's when Cloudflare started beta-deploying QUIC, the first public alpha of quiche was january 2019).

I don't know that there even was an in-kernel implementation of quic before msquic.sys which I believe first shipped in Server 2022 circa mid 2021 (and is used as the implementation backend by MsQuic on Server 2022 and W11).

lproven · 2026-05-13T10:50:17 1778669417

The article uses the term "CCAs" without ever defining it. I followed the links, and googled it, with no useful result.

What is a CCA in this context?

gavinsyancey · 2026-05-13T11:07:09 1778670429

a Congestion Control Algorithm -- which uses various signals (mostly dropped packets) to try to estimate the available bandwidth and avoid network connection.

lproven · 2026-05-13T11:55:51 1778673351

Thanks! And to @einsteinx2 and @rp8yxmdmr too.

Rp8yXmdmr · 2026-05-13T12:04:03 1778673843

There are so many overlapping TLA we should have moved to 4 letters long time ago.

lproven · 2026-05-13T13:07:00 1778677620

Twas ever thus.

There was the proposed eTLA namespace extension...

https://www.catb.org/jargon/html/T/TLA.html

einsteinx2 · 2026-05-13T11:05:20 1778670320

After some searching apparently it means “congestion control algorithm”. Definitely should have been defined in the article, especially since they have a whole section dedicated to explaining what it is.

Rp8yXmdmr · 2026-05-13T11:02:01 1778670121

Congestion Control Algorithms

echoangle · 2026-05-13T09:17:33 1778663853

Looking at the last plot, it seems like the backoff is roughly 1/5 of the total bandwith and it happens every 50 ms or so. Wouldn't it make sense to reduce the backoff and the growth speed if a backoff occurs repeatedly in rapid succession? We want to maximize the area under the curve (transmitted packages), right?

extropy · 2026-05-13T06:33:57 1778654037

Is it just me, or the article structure and subtitles feel very AI?

yuye · 2026-05-13T07:10:49 1778656249

The first half wasn't too bad, but the AI tells get strong in the second half.

philipwhiuk · 2026-05-13T09:50:11 1778665811

The tell I always spot is it's propensity to bold random words frankly.

bonzini · 2026-05-13T06:42:18 1778654538

Yes, and it becomes unbearable after a while.

twoodfin · 2026-05-13T13:36:47 1778679407

I don’t get it. Unlike a lot of the technical article slop that is posted here, this obviously had a lot of human thought and effort put into the prompt.

The LLM pass (unsurprisingly) made it worse.

For example:

The results were conclusive: 100% pass rate, showing Reno recovered cleanly after the loss phase, and revealing that this is a CUBIC-related bug.

Look, I’m reading a description of a Linux kernel network congestion bug. I don’t need the hand-holding.

blahgeek · 2026-05-13T02:17:35 1778638655

The more precise title should be: How we copied code from Linux kernel without fully understand it and missed its follow-up fixes, now it bites us

embedding-shape · 2026-05-13T10:28:24 1778668104

Also, not a single takeaway about how to prevent that very preventable issue in the first place, as you allude to.

I wonder what happened with the very hardcore engineering that used to happen at Cloudflare and was published? Almost every blog post today seems to expose some weirdness at Cloudflare, rather than highlighting excellence in engineering, what changes? Been slowly changing over the years, did they change their hiring practices or something?