Interested in Linux and open source news? You can subscribe to my weekly newsletter, cron.weekly, and stay up-to-date on news in open source, cool new projects and useful tips tailored to linux sysadmins!

Profiling: Learning from Engadget’s infrastructure on traffic spike handling (ie: iPad2 launch)

Author: No comments

I was impressed by the iPad 2 coverage on Engadget today. Not because of the iPad, but because the site was being hammered by thousands of requests and refreshes, and it never failed. At first glance, it seemed they were not caching anything either, as a refresh would immediately show you new content on the page, which updated every 30 seconds or so. So that made me curious, how do they do it?

Images, CSS & Javascript via Akamai's CDN

The live coverage included a lot of pictures taken, from the iPad and all presentation slides and videos they used. Instead of hosting those themselves, they placed those on Akamai's Content Delivery Network (CDN).

Their CSS and javascript files are also hosted by Akamai. The same CDN provider, but on a different domain name.

A smart move to host those elsewhere, as Akamai has the ability to place the files on multiple servers worldwide, and have the one geographically closest to you serve the file. It removes load from your systems and saves you bandwidth.

Not only that, hosting them on different domain names helps because your browser can download files in parallel. Splitting content on multiple domains has a DNS "penalty" (your browser needs to do a DNS lookup), but depending on your setup this will improve page loading times greatly.

Citrix Netscaler

As reported by Netcraft, Engadget uses Citrix Netscaler as an all-in-one (expensive) package for load balancing and content caching. This will alsof compress pages and send it to the client, allowing it to be downloaded faster. Since it's on a seperate system (the Citrix Netscaler) it does not cause load on your actual webservers, because the dedicated Citrix box will take care of that.

The downside is it makes "guessing" what is behind this system very hard. I've tried several methods of determining how many webnodes are behind it, but all HTTP headers seem to be returned in the same way, so I can not differentiate between nodes.

# curl -I
HTTP/1.1 200 OK
Date: Thu, 03 Mar 2011 00:05:56 GMT
Server: Apache/2.2
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Set-Cookie: GEO-[SNIP]; expires=Fri, 04-Mar-2011 00:05:56 GMT; path=/
Content-Type: text/html

This can either be caused by source-based load balancing (directing my requests to the same backend webnode for all requests), or because the Citrix box simply rewrites them.

To test, I've used tools like lbd (Load Balancing Detector) and Halberd. They also work on making many requests, and finding the differences in HTTP header responses.

No Intrusion Prevention System

This kind of suprised me, as I would have guessed they have an IPS in place. Perhaps there's an IDS, but an IPS could protect them from common scanning or vulnerability techniques.

To test for this, I used the Active Filtering Detection tool from PureHacking, tried with classical null-byte terminated strings and phf vulnerability. None were (actively) blocked. Perhaps my IP would have been blocked after a few hours, when some poor bloke goes through the IDS logs.

Gief moar.

I'm curious to know how more information can be gathered on their server infrastructure. I believe the Citrix Netscaler to be a huge blocking factor, as it pretty much "anonimizes" all HTTP traffic from their backend webnodes. I still have remaining questions, left unanswered.

  • Do they run a default Apache? Headers clame so, but headers are easily modified.
  • How many webservers do they have serving web content?
  • Do they cache pages, and if so: for how long?

Perhaps some day, I'll find out. :-)

Add Your Comment