A pretty long presentation (it runs for about an hour) by Aber Whitcomb, CTO of MySpace and senior chief operations & engineering. They go into great detail of their caching system, database-design, servers/clusters, software deployment used, their own distributed file system, …
**The presentation itself
** It’s presented using Silverlight, but provides a cool way to navigate quickly to the part of the presentation that interests you more. There’s also a long Q&A afterwards, where users ask some good questions such as “What kind of impact did 64bit have (and the allowing of more GB RAM) to your system structure?", “Is memcache used?", “How is syncrhonization handled between servers?".
You can view the presentation here for a presentation of MySpace in 2007, or an older presentation, from MySpace in 2006 (the differences are also interesting to note).
Global overview
It’s interesting to see how they manage 100 000 pageviews per second, and what this means for server-architecture and general network overview. They do this by using 6 000 web servers, 250 database servers, 600 Ad Servers (their main income), and a couple of hundreds cache servers (which serve all their content from 16GB RAM). The data is stored on a SAN containing 7 000 hard disks, and providing 70 000Mb/s (That’ s 70 Gigabit!)
Software development & deployment
They’ll also go into greater detail of the software used to develop a site of this size (which sounds like a huge Microsoft-commercial), and how to deploy your source code over all these servers (using CodeSpew).
Deploying new source code over their webserver form, they use UDP for outgoing connections, where they’re using a file-hash to check whether a file needs updating (it doesn’t matter if packets get lost then), and TCP for incoming traffic – because you obviously want the source code to arrive, and not have packetloss. They can serve all the code to their “farm” within 10 seconds, that’s pretty impressive.
**Media content (images, movies & music)
** Another part goes over the media content and delivery, because MySpace has evolved to a more media-rich portal than before. Serving more than 1 billion images, using up to 80TB of data. By receiving 60 000 new videos a day, they’re already up to 60TB of storage. And with the 25 million songs, they’re close to 150TB of data – only on mp3s. This needs to be streamed to the user, 24 hours a day, with sufficient bandwidth to avoid buffering all the time.
This means processing images that are being uploaded (resizing, creating thumbnails, storing the data), re-encoding videos to a uniform file format, and processing music. This includes creating thumbnails from videos uploaded, on the fly. Each type of media has its own “farm” to process the file accordingly.
Depending on the type of media, the data is deployed to one of three spaces: a MySpace DFS (Distributed File System), a CDN (Content Delivery Network) FTP server or a specific application.
**MySpace Distributed File System (DFS)
** Faced with the problem to serve massive amounts of data, in a fast pace, they felt the need to develop a DFS of their own. Serving 2 billion objects isn’t a big deal, if only 5% of the objects are served at a time. MySpace’s problem was that more of its content was accessed on a daily basis, so a caching algorythm wouldn’t work here. This gets more complicated, as most MySpace users only listen to the first 7 seconds of an MP3, and then click on to a next one – but in the background a different process was started for each MP3 in order to serve the full file to the user.
DFS is an object oriented file store, creating a near-infinite storage using horizontal storing of the data. It allows for great scaling too, upgrading granualy or cluster by cluster. Working together with a caching system of their own, and integrating with the storage layer in HTTP, allows for quick read times on their storage platform (called Sledgehammer).