Every now and then, you'll encounter a problem with an application or a script that is not clear straight away. After the normal troubleshooting, it can be helpful to see the actual system calls that occur when that script executes.
Using a tool like strace (manpage) can help you in identifying what is causing the system to slow down or misbehave.
In this blogpost I’ll show you some examples where strace can be useful for you. Most of it will be with PHP code but they’re easy enough anyone can understand them.
Installing strace
Strace isn’t installed by default on most distributions. To install, do a simple yum install strace when on CentOS/Red Hat or apt-get install strace on Debian/Ubuntu systems.
Getting output from strace
You can use strace in two different ways. You can attach it to an already running process or you can use it to start a custom application or script and follow all system calls from the very beginning.
In short, here’s how it goes. If you want to start your application and troubleshoot it from beginning to end, you do this:
~# strace -f $command ~# strace -f php -q somefile.php
Or you want to attach to a running process, use this:
~# strace -f -p $pid ~# strace -f -p 8151
The -f parameter tells strace to follow any children or processes that are spawned/forked from the application.
Standard usage of strace
By default, strace will show you -all- system calls that your application or script is performing. That can get pretty overwhelming, but it’s a good place to start. Take for instance the following simple script called ‘test1.php’.
<?php
/* Simple buggy script */
for ($i = 0; $i < 5; $i++)
sleep(1);
?>
The command above will simply iterate over the loop and sleep (do absolutely nothing) for 1 second on each iteration. So, this would do nothing for exactly 5 seconds. Best script I’ve ever written! ;-)
If you were to strace the execution of the script, it would look like this.
~# strace php -q test1.php execve("/usr/bin/php", ["php", "-q", "test1.php"], [/* 23 vars */]) = 0 brk(0) = 0xa036000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 ... nanosleep({1, 0}, {1, 0}) = 0 ... nanosleep({1, 0}, {1, 0}) = 0 ... nanosleep({1, 0}, {1, 0}) = 0 ...
At first, it’s overwhelming. You don’t need to understand every system call (sure, it helps, but the names of most functions explain the actions anyhow), but the execution can show you a few remarkable things straight away.
When it’s running, you’ll see the application hang every second with the nanosleep({1, 0}, {1, 0}) = 0 message. The nanosleep()
is the system call for the sleep()
method in PHP. Without having seen the source code of the application, this tells you the application specifically requested to sleep() there and waited.
But since the output was a bit too much to work with, let’s look at filtering it.
Filtering the strace output to something more meaningfull
Have a look at the following simple script as a new example.
<?php
$fp = fsockopen('www.google.be', 80, $errno, $errstr, 30);
?>
It opens a socket to www.google.be over port 80, a very simple action.
The same kind of action would occur when connecting to remote MySQL databases, accessing a remote API via curl/fsockopen/file_get_contents/…
If you strace it, the network connectivity won’t immediately be obvious because of all the other sysetm calls. That’s why you can pass some extra arguments to strace to make that more clear.
~# strace -e trace=network php -q test2.php ... connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("193.239.211.254")}, 28) = 0 send(3, "<b\1\0\0\1\0\0\0\0\0\0\3www\6google\2be\0\0\1\0\1", 31, MSG_NOSIGNAL) = 31 recvfrom(3, "<b\201\200\0\1\0\2\0\4\0\4\3www\6google\2be\0\0\1\0\1\300"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("193.239.211.254")}, [16]) = 219 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("74.125.79.94")}, 16) = -1 EINPROGRESS (Operation now in progress) getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
This shows you a few specific things. The first connect, send and recvfrom are to translate the hostname “www.google.be” to the IP address “74.125.79.94”, so DNS resolving.
The last connect
is the actual connection attempt based on the IP address to the Google systems. It shows you the IP, the port and the state (Operation now in progress).
If I were to change my script to connect to port 85 instead of 80, the results would be quite different, since the connection to port 85 will not succeed.
~# strace -r -e trace=network php -q test2.php ... 0.000107 connect(3, {sa_family=AF_INET, sin_port=htons(85), sin_addr=inet_addr("74.125.79.94")}, 16) = -1 EINPROGRESS 20.999554 getsockopt(3, SOL_SOCKET, SO_ERROR, [110], [4]) = 0
I’ve changed the strace call a bit, by adding the -r
parameter.
That shows you timestamps of how long that specific call took relative to the previous call. By stracing, you will notice immediately that your application or script is waiting for the connect() call to finish.
That may not immediately be obvious from just executing the script via php, but it does show clearly when using strace. In this case, it took exactly 20s for that connection to timeout. Either that’s because your connection is blocked by a firewall, the service on the other end did not respond or took too long to respond.
Other means of filtering the output are:
- -e trace=file: show all system calls that take a file as argument. This can show you where a lot of disk I/O may occur. Every open, stat, chmod, unlink, … shows up there;
- -e trace=process: track forks of the script
- -e trace=open: show all calls that invoke the open() method.
You can substitute the -e trace=xxx with any kind of system call. That can be open, close, read, write, stat, chmod, unlink, …
Spotting query output in strace
When using strace to trace an Apache process for instance, you can often find output like this.
munmap(0xb7973000, 1314816) = 0 brk(0x1027b000) = 0x1027b000 poll([{fd=740, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(740, "\200\0\0\0\3SELECT `something` "..., 132) = 132 read(740, "\1\0\0\1!n\0\0\2\3def\fdbname\35something"..., 16384) = 3931 mmap2(NULL, 1314816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7973000 ... munmap(0xb77f0000, 1314816) = 0 poll([{fd=740, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(740, "\200\0\0\0\3SELECT `something` "..., 132) = 132 read(740, "\1\0\0\1!n\0\0\2\3def\fdbname\35something"..., 16384) = 3931 mmap2(NULL, 1314816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77f0000
If you look at (parts of) the strace above, you will notice the lines in bold. In this case, a script kept on looping and launching a query to the backend database. A loop like that only shows up as very high CPU usage of an Apache process, with strace you can gain the insights in to why that process is looping or causing a high load.
Again, you don’t need to understand every call that strace will output, but you will notice the SQL statements that are sent in plain text over the wire.
Why use strace?
I often fall back to strace to troubleshoot applications that are running for an abnormally long time (using strace -p PID, where PID is the process ID of that particular script or application). It’s very useful to be able to attach to a running process and see what it is doing. Even when you don’t fully understand the output, you do notice repetitive tasks very easily, which could indicate the program is in some kind of loopt.
In hosting environments, it’s not uncommon for me to strace certain Apache or FastCGI processes to detect why it’s causing an abnormally high load or why it’s running for an abnormally long time.
When you’re debugging applications that appear to stop or crash halfway with no immediate error message (because errors are being suppressed somewhere in the middle of application execution), stracing it could be a good way to see the final system calls before the application halts.
Other resources
If you want to read more, I suggest taking a look at the following articles.
- 5 simple ways to troubleshoot with strace
- Fun with strace and the GDB Debugger
- Why would brk() in strace output be taking several seconds?
If you have any other useful strace-tips, please share them in the comments!