Using Webserver Access Logs As A Database Storage System

Want to help support this blog? Try out Oh Dear, the best all-in-one monitoring tool for your entire website, co-founded by me (the guy that wrote this blogpost). Start with a 10-day trial, no strings attached.

We offer uptime monitoring, SSL checks, broken links checking, performance & cronjob monitoring, branded status pages & so much more. Try us out today!

Profile image of Mattias Geniar

Mattias Geniar, May 02, 2015

Follow me on Twitter as @mattiasgeniar

I used this hack a few days ago when I launched the Drupal EngineHack Detection Website, and it’s serving its purpose just fine.

My usecase

The EngineHack site scans a website and tells the user if it has been hacked or not. So for that particular tool, I want to log the results of those scans. Most importantly, I wanted to log:

  • The timestamp of each scan that was performed

    • The URL of the site that was scanned
      • Whether the Drupal site was compromised or not
        • Optionally the IP of the scanner, in case of abuse/DoS hammering

        Traditionally, I would create either a table in a RDMS like MySQL or a simple key/value system like MongoDB and store the results in there. But I didn’t want to spend much time dealing with SQL injection, data validation, …

        And storing things inside a MySQL table isn’t always as practical: I had no GUI, so everything had to be done at the CLI. Creating tables, query-ing data, … It all sounded like a lot of work, for a tool this simple.

        There must be an easier system, right?

        Access Logs as a Storage Method

        Everything you do in a browser gets logged on the webserver. Every timestamp, every URL and every GET parameter. I can use that to serve my purpose!

        access_logs_everywhere

        Most of what I wanted to log was already present in the logs. I had the timestamp and the IP of each scan.

        All that was left, was the URL of the Drupal site that has been scanned and the result: compromised, yes or no.

        I solved that by including a hidden 1px by 1px image in the result page. The URL was like this.

        <img src="/check_pixel.png?url=http://www.domain.tld&compromised=false"
        

width="1px” height="1px” />

        Nobody notices this in the browser. It's the same kind of technique many trackers use in mailing tools, to check for the open rates of newsletters. 
            
        All I had to do now, was check my access logs for `GET` requests to that particular .png file and I had everything: the **IP, the timestamp, which site got scanned and what the result was**.
            
        <pre>...

10.0.1.1 [28/Apr/2015:11:19:54] “GET /check_pixel.png?url=http://some.domain.tld&compromised=false HTTP/1.1” 200 901 10.0.1.1 [28/Apr/2015:11:20:05] “GET /check_pixel.png?url=http://www.domain.tld&compromised=true HTTP/1.1” 200 901 …

            Perfect!
            
            ## Querying the dataset
            
            Granted, in the long run, a SQL statement is easier than this. But since I live mostly at the CLI, this feels more natural to me.
            
            How many sites have been scanned?
            
            <pre>$ grep 'check_pixel.png' scan_results.log | sort | uniq | wc -l

843

            Note I'm not using `grep -c` to count, since some sites have been checked multiple times, I only want the unique values.
            
            How many were compromised?
            
            <pre>$ grep 'compromised=true' scan_results.log | sort | uniq | wc -l

9

            Which sites have been scanned?
            
            <pre>$ awk '{print $7}' scan_results.log | sort | uniq

… /check_pixel.png?url=http://domain.tld&compromised=false /check_pixel.png?url=http://otherdomain.Tld&compromised=false …

            Which were compromised?
            
            <pre>$ awk '{print $7}' /var/www/enginehack.ma.ttias.be/results/scan_results.log | grep 'compromised=true' | sort | uniq

… /check_pixel.png?url=http://domain.tld&compromised=true /check_pixel.png?url=http://otherdomain.Tld&compromised=true …

            I have all my queries I need, right there at the CLI.
            
            ## Logrotate
            
            One downside to this system is that it lacks several of the [ACID][2] properties. Most importantly, to me at least, is the durability.
            
            I've got logrotate configured to rotate all logs every night and store them for 7 days. That would mean my scan-logs would also be deleted after 7 days once the webserver access logs are cleared.
            
            A simple script takes care of that.
            
            <pre>#!/bin/bash

grep ‘check_pixel.png’ /path/to/access.log » /path/to/permanent/logs/scan_results.log

            That runs every night, before logrotate. It takes results from the current log, stores them safely and appends them to the other logs. Easy.
            
            ## Benefits and downsides
            
            For me, this technique worked flawlessly. My benefits;
            
              * No complex code to store data in MySQL tables 
                  * Moved the concurrency issue for logging results to the webserver 
                      * Use proven technology for logging requests </ul> 
                        I could live with the downsides as well.
                        
                          * No easy querying of data, everything happens with grep/sed/awk at the CLI 
                              * Consistency of data is solved by a simple cron-task </ul> 
                                And there we have it. This implementation took me 30 seconds to make and it has the same results, at least for me, as a relational database. Implementing the MySQL solution, since it's been a while for me, would have taken 30 minutes or more. Not to mention the security angle of SQL injection, sanitising data, ...
                                
                                Glad I didn't have to do that.
                                
                                ## Caveats
                                
                                Obviously, in the long run, I should have stored it in a MySQL table. It would allow for much better storage system.
                                
                                This #OpsHack worked for me, because my dataset is simple. The amount of possible permutations of my data is incredible small. As soon as the complexity of the data increases, using the access logs to store anything is no longer an option.
                                
                                Just like my other #OpsHack [(abusing zabbix to monitor the HackerNews submissions)][3], this was the easiest solution for me.
                                
                                <blockquote class="twitter-tweet" data-cards="hidden" lang="en">
                                  <p lang="en" dir="ltr">
                                    <a href="https://twitter.com/hashtag/OpsHack?src=hash">#OpsHack</a>: Using Webserver Access Logs As A Database Storage System <a href="https://t.co/Z34UjpYzY3">https://t.co/Z34UjpYzY3</a>
                                  </p>
                                  
                                  <p>
                                    &mdash; ma.ttias.be (@mattiasgeniar) <a href="https://twitter.com/mattiasgeniar/status/594480240841326592">May 2, 2015</a>
                                  </p>
                                </blockquote>


Want to subscribe to the cron.weekly newsletter?

I write a weekly-ish newsletter on Linux, open source & webdevelopment called cron.weekly.

It features the latest news, guides & tutorials and new open source projects. You can sign up via email below.

No spam. Just some good, practical Linux & open source content.