Linux debugging tools I use daily

Every server I run, including the fleet behind Oh Dear ’s uptime checks, has the same set of debugging tools installed before anything goes wrong. Not because the application needs them, but because the one time you reach for strace is at 2am, the box is misbehaving, and apt install is the last thing you want to be doing while you work out what broke.

So I keep a small toolbox baked into the base image. These are the ones I reach for, grouped by the kind of question they answer, each with a real example.

Everything below is from a stock Ubuntu 24.04 box with the tools installed. Nothing is faked; where I’ve trimmed output, I say so.

When you suspect the network#

Is DNS even resolving? dig +short cuts the noise. No headers, no authority section, just the answer:

$ dig +short ma.ttias.be A
104.26.8.203
172.67.71.51
104.26.9.203

host gives you the same thing in a sentence, which is sometimes all you want:

$ host ma.ttias.be
ma.ttias.be has address 104.26.8.203

When a server is reachable from one place but not another, mtr is traceroute and ping in one, and it keeps sampling so you can see where the loss starts. Report mode gives you something you can paste into a ticket:

$ mtr -rwc 2 1.1.1.1
HOST: ...             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 172.17.0.1      0.0%     2    0.1   0.7   0.1   1.2   0.8
  2.|-- one.one.one.one 0.0%     2   22.6  25.9  22.6  29.2   4.7

If mtr greets you with “unable to get raw sockets”, that’s the permissions thing I ran into on macOS years ago ; it needs root or the right capability. tracepath is the unprivileged cousin for when you can’t get either.

Before blaming the application, check the socket. nc answers “can I even reach this port” in one line:

$ nc -zv ma.ttias.be 443
Connection to ma.ttias.be (104.26.8.203) 443 port [tcp/*] succeeded!

whois lives in this group too, for the unglamorous “who owns this domain and when does it expire” question that turns out to be the actual answer more often than I’d like.

And then there’s the one I trust when the logs and the client disagree: tcpdump. The packets don’t lie.

$ tcpdump -nni any -c 4 icmp
... IP 172.17.0.2 > 1.1.1.1: ICMP echo request, id 1, seq 2, length 64
... IP 1.1.1.1 > 172.17.0.2: ICMP echo reply, id 1, seq 2, length 64

tcpdump shows you the headers. When you care about the payload, ngrep is grep for the wire: match a string across live traffic and print the packets that contain it.

$ ngrep -d any -W byline "GET" tcp dst port 80
T 172.17.0.2:43090 -> 1.1.1.1:80 [AP]
GET / HTTP/1.1.
Host: 1.1.1.1.
User-Agent: checker/1.0.

For HTTP specifically I’ve also leaned on httpry . And for the live “what’s eating my bandwidth right now” view there’s iftop (per connection) and nethogs (per process). Both are interactive and both want a terminal, so there’s nothing to paste, but they’re the fastest way to find the one process saturating a link.

Processes and syscalls#

htop is what I open first on any box. It’s top with colour, scrolling, a tree view on F5, and the ability to kill a process without leaving the screen. Nothing to paste, you just look at it, but it’s the dashboard I live in.

When I want the parent-and-child story instead, pstree shows who spawned what, which is how you find a runaway worker’s real parent before you kill the wrong thing.

lsof answers a surprising range of questions. “What’s listening on this port” is one line:

$ lsof -i -P -n
COMMAND  PID USER   FD   TYPE DEVICE NODE NAME
nc      1035 root    3u  IPv4 297857  TCP *:8080 (LISTEN)

It also answers “why can’t I unmount this disk” (something still has a file open on it) and “this deleted log file is still eating space” (a process is holding the fd open). It’s the same flavour of poking at a live process as reading its environment variables , which I’ve written about before.

When something hangs and the logs are silent, strace shows you every system call a process makes. strace -c runs a command and summarises where the time and the errors went:

$ strace -c ls /
% time     seconds  usecs/call     calls    errors syscall
 23.52    0.000302         151         2           getdents64
  4.05    0.000052          26         2         2 statfs
  2.26    0.000029          14         2         2 ioctl

That errors column is the gold. A process failing to openat a config file it can’t find shows up there immediately, even when the application swallows the error. strace is probably the tool I’ve written about most, including the time it refused to attach because of ptrace_scope and the related PHP-FPM ptrace peekdata error . If you’ve never used it, those two are a fine place to start.

For the plain “which process is hogging memory” question I still reach for ps, sorted, the way I wrote up here . The psmisc package that gives you pstree also ships fuser and killall, for “what’s using this file” and “kill every instance of X”.

Disk and I/O#

Disk is the bottleneck people check last and should check first.

iostat -xz 1 is my first look. The %util and await columns tell you whether the disk is the problem before you go blaming the database (columns trimmed to the ones I read):

$ iostat -xz 1 1
Device   r/s   w/s   wkB/s  r_await  w_await  %util
vda     0.06  0.16   12.58     0.93    10.40    0.03

pidstat narrows it to a single process over an interval. Here it is catching a busy one redhanded:

$ pidstat -p 1123 1 1
  UID    PID   %usr %system   %CPU  CPU  Command
    0   1123  33.66   65.35  99.01    1  yes

iotop is top for disk I/O: which process is reading or writing right now. It needs kernel task-delay accounting and root, so it isn’t always available (a container will tell you task_delayacct is 0), but on a real server it’s the quickest way to find the process thrashing a disk. iostat, pidstat, and sar all come from sysstat, which I lean on hard; I went deep on measuring Linux performance without fooling yourself a while back.

For “what’s eating all the disk space”, ncdu is du with a navigable interface. Point it at a directory, let it scan, then walk the tree to find the 40GB of logs nobody rotated. Much nicer than du -sh * and squinting at the output.

The two that just make life easier#

jq for anything JSON, which is most things now. It slices an API response without a throwaway script. Pulling just the failing checks out of a response:

$ curl -s ... | jq '.checks[] | select(.ok == false) | .name'
"http"

And tree, the least serious tool on the list, for seeing the shape of a directory at a glance:

$ tree /tmp/d
/tmp/d
`-- a
    |-- b
    |   `-- y.log
    `-- x.conf

Why a toolbox#

None of these are exotic, and that’s the point. The value isn’t any single tool, it’s having all of them already there so the 2am version of you goes straight from “something’s wrong” to “here’s what’s wrong” without a detour through the package manager. The list is two dozen packages. The payoff is every incident after.