The backup SSH daemon I run before every do-release-upgrade

I spent a chunk of the last few weeks upgrading a fleet of Ubuntu servers in place, one LTS to the next, with do-release-upgrade. Dozens of boxes, mostly stateless, mostly boring once you’ve done the first few.

The thing that kept trying to lock me out wasn’t the new OS, the kernel, or some package that wouldn’t configure. It was SSH itself, dying in the middle of the upgrade, on the exact connection I was using to run it.

If you only ever upgrade one box every couple of years, you might never notice this, or you’ll blame the network and reconnect. Do it thirty times in a row and the pattern is impossible to miss. So here’s what happens, why the built-in safety net doesn’t fire when you’ve done everything else right, and the backup daemon I now start before touching anything.

SSH dies mid-upgrade#

Partway through every hop, new SSH connections to the box start failing. From my laptop:

$ ssh user@server -p 22
kex_exchange_identification: read: Connection reset by peer
Connection reset by 167.x.x.x port 22

The session I was already in stays alive. systemctl status ssh on the box says active (running). sshd -t says the config is fine. But every new connection gets reset, for minutes at a time.

This isn’t a bug, it’s openssh-server upgrading itself out from under you. sshd uses privilege separation: the listening parent re-executes the on-disk sshd binary for every new connection. During the upgrade, the new binary lands on disk while the parent process in memory is still the old one. The two disagree about the format of the state they pass to each other, and the handshake collapses. The machinery is right there in the binary:

$ strings /usr/sbin/sshd | grep rexec
rexec of %s failed: %s
send_rexec_state
incomplete message
rexec version mismatch

On the server side it shows up in the auth log as a recv_rexec_state: buffer error: incomplete message. Nothing is broken in a way you need to fix. The box is fine. You just can’t get a new shell on it until the upgrade finishes replacing openssh and the parent gets restarted (which a reboot does cleanly).

The catch is that “you can’t get a new shell” is exactly the situation you do not want to be in halfway through an OS upgrade, when something else might need your attention.

tmux disables the safety net#

Ubuntu’s own upgrader knows about this. When it detects you’re running over SSH, it starts a second sshd on port 1022 specifically so you have a spare door if the main one jams. Here’s the actual code in ubuntu-release-upgrader-core (1:22.04.20 on jammy):

def _sshMagic(self):
    """ this will check for server mode and if we run over ssh.
        if this is the case, we will ask and spawn a additional
        daemon (to be sure we have a spare one around in case
        of trouble)
    """
    pidfile = os.path.join("/var/run/release-upgrader-sshd.pid")
    if (not os.path.exists(pidfile) and
        os.path.isdir("/proc") and
        is_child_of_process_name("sshd")):
        ...
        port = 1022
        res = subprocess.call(["/usr/sbin/sshd",
                               "-o", "PidFile=%s" % pidfile,
                               "-p", str(port)])

Read the if. It only starts the spare sshd when is_child_of_process_name("sshd") is true, meaning the upgrader’s process has sshd somewhere up its parent chain. That function literally walks /proc/<pid>/stat from itself up to PID 1, looking for a process called sshd:

def is_child_of_process_name(processname, pid=None):
    if not pid:
        pid = os.getpid()
    while pid > 0:
        with open("/proc/%s/stat" % pid) as stat_f:
            stat = stat_f.read()
        command = stat.partition("(")[2].rpartition(")")[0]
        if command == processname:
            return True
        pid = int(stat.rpartition(")")[2].split()[1])
    return False

What disables that fallback is your next step, and it’s the right step to take. A release upgrade can take 20 minutes or more, and you should never run something that long on a raw SSH connection, because if your laptop’s wifi hiccups, the upgrade dies with it. So you run it inside tmux (or screen, which I’ve been preaching since 2008 ).

But tmux daemonizes. When you start a session, the tmux server forks off and gets reparented to init, it is not a child of your shell. So anything running inside tmux has a parent chain that goes to PID 1, never back through sshd. You can watch it happen:

$ tmux new-session -d -s upg 'sleep 120'
$ # walk the parent chain of that sleep:
  pid 4090 (sleep)       -> ppid 4089
  pid 4089 (tmux: server) -> ppid 1

The sleep inside tmux is a child of the tmux server, which is a child of init. sshd appears nowhere. So is_child_of_process_name("sshd") returns False, _sshMagic does nothing, and the upgrader’s spare sshd never starts.

The two correct decisions cancel each other out. You run the upgrade in tmux so a dropped connection can’t kill it, and that single act silently switches off the one feature that exists to save you when the connection drops. I don’t think most people running do-release-upgrade -f DistUpgradeViewNonInteractive in a tmux pane realize the port-1022 fallback they half-remember reading about isn’t actually running.

Start your own sshd#

So I stopped relying on Ubuntu’s version and start my own, before the upgrade, every time. It’s three commands.

These need root, so they’re prefixed with sudo. Log in as a normal user and sudo up for the privileged bits, don’t SSH in as root directly. Disabling direct root login (PermitRootLogin no) is one of the first things you should do on any box, so ssh root@server shouldn’t even work, and that’s a good thing.

Open the port in the firewall (match it to the same source restrictions your real SSH port has, don’t fling 1022 open to the world for the duration):

sudo ufw allow 1022/tcp comment "OOB sshd for upgrade"

Start a standalone sshd on it, with its own PID file and log so it’s easy to find and kill later:

sudo /usr/sbin/sshd -p 1022 -o PidFile=/tmp/sshd-1022.pid -E /tmp/sshd-1022.log

It daemonizes and returns immediately. Check the log says what you want:

$ sudo cat /tmp/sshd-1022.log
Server listening on 0.0.0.0 port 1022.
Server listening on :: port 1022.

Then, and this is the part people skip, actually connect to it from somewhere else before you start the upgrade:

$ ssh user@server -p 1022

A listening socket is not proof that a login works. The firewall rule might not match your source IP, the key exchange might fail, the port might be NAT’d somewhere you didn’t expect. You want to find that out now, with a perfectly healthy box, not in fifteen minutes when the main sshd has jammed and this is supposed to be your way back in. If ssh -p 1022 lands you at a shell, you have a real second door. Leave your original session open too, that’s your third.

This spare sshd is a separate process from the one being upgraded, so when openssh-server does its mid-upgrade re-exec dance on port 22, the daemon on 1022 keeps answering.

Even the backup goes dark#

Most of the time the port-1022 daemon rides straight through. On a handful of boxes, though, I watched both doors go dark at the same moment, port 22 and 1022, for several minutes, right when the upgrade was swapping openssh and its libraries.

The instinct is to assume the box is gone. It isn’t. Every time, it kept answering pings and kept serving traffic on 80 and 443 the entire time. The network stack was completely fine. The only thing temporarily unavailable was the ability to get a new SSH session, and it came back on its own once dpkg finished with openssh.

Two rules came out of that:

Don’t systemctl restart ssh to “fix” it. If you restart sshd from your spare session while dpkg is mid-unpack, you can land on a half-written config and turn a temporary outage into a permanent one. Wait it out. The clean fix is the reboot at the end of the upgrade anyway, which brings up a fresh sshd parent running the new binary.

Don’t use SSH to decide whether the upgrade finished. If your “is it done yet” check is an SSH command, an unreachable sshd looks identical to a dead box, and you’ll convince yourself you’ve bricked it. I run the upgrade detached and write the exit code to a file, then I only care about that file:

tmux new-session -d -s upg \
  'sudo do-release-upgrade -f DistUpgradeViewNonInteractive; echo "RC=$?" > ~/upgrade.rc'

The real liveness signal during the blind window isn’t SSH at all, it’s ICMP and whatever the box normally serves on 80/443. SSH is the one service guaranteed to be unreliable precisely because it’s the one being upgraded.

The rule#

Connectivity is paramount, and SSH is the one service that’s structurally guaranteed to break during the exact operation where you most need it. So assume it will, and have a second door you’ve already walked through before you start.

Pre-launch your own sshd on another port, test it from your laptop, run the upgrade detached so a dropped session can’t kill it, and capture the exit code to a file instead of polling over SSH. None of it is much work, and all of it is a lot cheaper than fighting a cloud provider’s web console that never seems to work, just to get back into a box that was never actually down.

Validated on Ubuntu 22.04 jammy: openssh-server 1:8.9p1-3ubuntu0.15, ubuntu-release-upgrader-core 1:22.04.20, systemd 255. The _sshMagic and is_child_of_process_name code is from the latter.