Profiling: Learning from Engadget’s infrastructure on traffic spike handling (ie: iPad2 launch)

WARNING: This post was originally published in 2011 and hasn't been updated since.
The tips, techniques and technology explained here may be outdated. If you spot any errors, please let me know in the comments so I can adjust the article. Thanks!

I was impressed by the iPad 2 coverage on Engadget today. Not because of the iPad, but because the site was being hammered by thousands of requests and refreshes, and it never failed. At first glance, it seemed they were not caching anything either, as a refresh would immediately show you new content on the page, which updated every 30 seconds or so. So that made me curious, how do they do it?

Images, CSS & Javascript via Akamai's CDN

The live coverage included a lot of pictures taken, from the iPad and all presentation slides and videos they used. Instead of hosting those themselves, they placed those on Akamai's Content Delivery Network (CDN).

Their CSS and javascript files are also hosted by Akamai. The same CDN provider, but on a different domain name.

A smart move to host those elsewhere, as Akamai has the ability to place the files on multiple servers worldwide, and have the one geographically closest to you serve the file. It removes load from your systems and saves you bandwidth.

Not only that, hosting them on different domain names helps because your browser can download files in parallel. Splitting content on multiple domains has a DNS "penalty" (your browser needs to do a DNS lookup), but depending on your setup this will improve page loading times greatly.

Citrix Netscaler

As reported by Netcraft, Engadget uses Citrix Netscaler as an all-in-one (expensive) package for load balancing and content caching. This will alsof compress pages and send it to the client, allowing it to be downloaded faster. Since it's on a seperate system (the Citrix Netscaler) it does not cause load on your actual webservers, because the dedicated Citrix box will take care of that.

The downside is it makes "guessing" what is behind this system very hard. I've tried several methods of determining how many webnodes are behind it, but all HTTP headers seem to be returned in the same way, so I can not differentiate between nodes.

# curl -I
HTTP/1.1 200 OK
Date: Thu, 03 Mar 2011 00:05:56 GMT
Server: Apache/2.2
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Set-Cookie: GEO-[SNIP]; expires=Fri, 04-Mar-2011 00:05:56 GMT; path=/
Content-Type: text/html

This can either be caused by source-based load balancing (directing my requests to the same backend webnode for all requests), or because the Citrix box simply rewrites them.

To test, I've used tools like lbd (Load Balancing Detector) and Halberd. They also work on making many requests, and finding the differences in HTTP header responses.

No Intrusion Prevention System

This kind of suprised me, as I would have guessed they have an IPS in place. Perhaps there's an IDS, but an IPS could protect them from common scanning or vulnerability techniques.

To test for this, I used the Active Filtering Detection tool from PureHacking, tried with classical null-byte terminated strings and phf vulnerability. None were (actively) blocked. Perhaps my IP would have been blocked after a few hours, when some poor bloke goes through the IDS logs.

Gief moar.

I'm curious to know how more information can be gathered on their server infrastructure. I believe the Citrix Netscaler to be a huge blocking factor, as it pretty much "anonimizes" all HTTP traffic from their backend webnodes. I still have remaining questions, left unanswered.

  • Do they run a default Apache? Headers clame so, but headers are easily modified.
  • How many webservers do they have serving web content?
  • Do they cache pages, and if so: for how long?

Perhaps some day, I'll find out. :-)

Leave a Reply

Your email address will not be published. Required fields are marked *



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Why ads?

I'm glad you made it to this blogpost. I hope it helps solve your problem. So why then do I show ads on the site? Writing content, testing it and making sure the layout isn't totally b0rked takes time. A lot of time. The ads are a way to pay back a small portion of that time.

And as you know running a site costs (a bit of) money: the domain name, webhosting, time spent writing and updating content, ... So if you like the content of this blog, consider disabling your AdBlocker for this domain. Thanks!

Looking for help?

Tired of fixing all these tech-problems yourself? We've got an excellent team at Nucleus, a top-class Belgian hosting provider, that can help you.

Discover our Managed Hosting, where skilled engineers manage your servers and keep them up-to-date, so you can focus on your core business. We use a variety of Configuration Management Systems such as Puppet to make sure every config is reviewed, unit-tested and guaranteed to be working.

Want to get in touch? Find me as @mattiasgeniar on Twitter or via the contact-page on this blog.