WARNING: This post was originally published in 2011 and hasn't been updated since.
The tips, techniques and technology explained here may be outdated. If you spot any errors, please let me know in the comments so I can adjust the article. Thanks!
I was impressed by the iPad 2 coverage on Engadget today. Not because of the iPad, but because the site was being hammered by thousands of requests and refreshes, and it never failed. At first glance, it seemed they were not caching anything either, as a refresh would immediately show you new content on the page, which updated every 30 seconds or so. So that made me curious, how do they do it?
The live coverage included a lot of pictures taken, from the iPad and all presentation slides and videos they used. Instead of hosting those themselves, they placed those on Akamai's Content Delivery Network (CDN).
A smart move to host those elsewhere, as Akamai has the ability to place the files on multiple servers worldwide, and have the one geographically closest to you serve the file. It removes load from your systems and saves you bandwidth.
Not only that, hosting them on different domain names helps because your browser can download files in parallel. Splitting content on multiple domains has a DNS "penalty" (your browser needs to do a DNS lookup), but depending on your setup this will improve page loading times greatly.
As reported by Netcraft, Engadget uses Citrix Netscaler as an all-in-one (expensive) package for load balancing and content caching. This will alsof compress pages and send it to the client, allowing it to be downloaded faster. Since it's on a seperate system (the Citrix Netscaler) it does not cause load on your actual webservers, because the dedicated Citrix box will take care of that.
The downside is it makes "guessing" what is behind this system very hard. I've tried several methods of determining how many webnodes are behind it, but all HTTP headers seem to be returned in the same way, so I can not differentiate between nodes.
# curl -I www.engadget.com
HTTP/1.1 200 OK
Date: Thu, 03 Mar 2011 00:05:56 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Set-Cookie: GEO-[SNIP]; expires=Fri, 04-Mar-2011 00:05:56 GMT; path=/
This can either be caused by source-based load balancing (directing my requests to the same backend webnode for all requests), or because the Citrix box simply rewrites them.
No Intrusion Prevention System
To test for this, I used the Active Filtering Detection tool from PureHacking, tried with classical null-byte terminated strings and phf vulnerability. None were (actively) blocked. Perhaps my IP would have been blocked after a few hours, when some poor bloke goes through the IDS logs.
I'm curious to know how more information can be gathered on their server infrastructure. I believe the Citrix Netscaler to be a huge blocking factor, as it pretty much "anonimizes" all HTTP traffic from their backend webnodes. I still have remaining questions, left unanswered.
- Do they run a default Apache? Headers clame so, but headers are easily modified.
- How many webservers do they have serving web content?
- Do they cache pages, and if so: for how long?
Perhaps some day, I'll find out. :-)