Linux futex_wait bug

Mattias Geniar, Thursday, May 14, 2015

A deep technical read, but something you better be aware of.

TL;DR: make sure you update your Linux kernels in the near future, or you'll experience some nasty deadlocks.

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc.

If you are lucky you may also find soft lockup messages in your dmesg logs.
Linux futex_wait() bug...

Anything running RHEL 6x or CentOS 6.x is advised to upgrade to the latest kernel (2.6.32-504.16.2 or higher). The post mentions it happens mostly on systems with Intel's Haswell processors (Xeon E3 v3, Xeon E5 v3, etc).

If you haven't been bitten by this bug, it's probably just a matter of time. Or perhaps you've experienced a service that crashed, couldn't figure out the actual reason and left it at a "meh, I'll just restart it and it'll be fine", just be done with it.

The changelog for the 2.6.32-504.16.2 kernel on CentOS 6.6 mentions this futex fix.

$ yum install yum-changelog python-dateutil
$ yum changelog all kernel-2.6.32-504.16.2.el6 | grep futex
...
- [kernel] futex: Ensure get_futex_key_refs() always implies a barrier (Larry Woodman) [1192107 1167405]
...

It's a long shot, but this kernel bug may be the actual reason.



Hi! My name is Mattias Geniar. I'm a Support Manager at Nucleus Hosting in Belgium, a general web geek & public speaker. Currently working on DNS Spy & Oh Dear!. Follow me on Twitter as @mattiasgeniar.

Share this post

Did you like this post? Will you help me share it on social media? Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *