Google opens up

Want to help support this blog? Try out Oh Dear, the best all-in-one monitoring tool for your entire website, co-founded by me (the guy that wrote this blogpost). Start with a 10-day trial, no strings attached.

We offer uptime monitoring, SSL checks, broken links checking, performance & cronjob monitoring, branded status pages & so much more. Try us out today!

Profile image of Mattias Geniar

Mattias Geniar, June 02, 2008

Follow me on Twitter as @mattiasgeniar

Google, ‘s werelds meest geheimzinnige grootmacht in de IT-wereld, geeft meer informatie vrij over haar datacentra, verspreid over Amerika, Europa, Afrika & Azië. Zo komen we te weten dat het 36 datacentra heeft in totaal, met 150 racks per datacenter, en 40 servers per rack. Een slordige 216 000 servers.

De grootste commerciele providers in België (Nucleus? ;-) ) hebben maximaal zo’n 1 000 servers in beheer, waarvan minstens 75% pure colocatie is – dus waar de provider zelf geen onderhoud op moet doen. Hoe installeer, beheer & deploy je in godsnaam 216 000 (twee honderd zestien duizend!) servers? En hoe hou je dat draaiende en winstgevend door wat advertenties te verkopen?

Allemaal vrij ‘mind boggling’, zeker als je je realiseert dat Google nog maar sinds ‘96 bestaat, en ocharme begonnen is met maar enkele servers … Now, let’s talk numbers! Google verdeelt haar datacentra in “clusters”, dewelke worden onderverdeeld in racks, die dan weer uit servers bestaan.

In each cluster’s first year, it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span. And there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.

Hun systeem van goedkope hardware te gebruiken, in combinatie met ijzersterke, zelfgeschreven software (zoals Google File System, Big Table en Map Reduce) lijkt (ondanks de enorme failure rates) wel te werken. Even wat copy/paste van Cnet, die de presentatie van Google mocht meemaken.

GFS (Google File System) stores each chunk of data, typically 64MB in size, on at least three machines called chunkservers; master servers are responsible for backing up data to a new area if a chunkserver failure occurs. “Machine failures are handled entirely by the GFS system, at least at the storage level,” Dean said.

To provide some structure to all that data, Google uses BigTable. Commercial databases from companies such as Oracle and IBM don’t cut the mustard here. For one thing, they don’t operate the scale Google demands, and if they did, they’d be too expensive, Dean said.

BigTable, which Google began designing in 2004, is used in more than 70 Google projects, including Google Maps, Google Earth, Blogger, Google Print, Orkut, and the core search index. The largest BigTable instance manages about 6 petabytes of data spread across thousands of machines, Dean said.

MapReduce, the first version of which Google wrote in 2003, gives the company a way to actually make something useful of its data. For example, MapReduce can find how many times a particular word appears in Google’s search index; a list of the Web pages on which a word appears; and the list of all Web sites that link to a particular Web site.

With MapReduce, Google can build an index that shows which Web pages all have the terms “new,” “york,” and “restaurants”–relatively quickly. “You need to be able to run across thousands of machines in order for it to complete in a reasonable amount of time,” Dean said.

Voor het ganse artikel (wat trouwens nog de moeite waard is ook), kan je terecht op de Cnet Blog. Een boeiend levendje, daar bij Google Inc. :-)



Want to subscribe to the cron.weekly newsletter?

I write a weekly-ish newsletter on Linux, open source & webdevelopment called cron.weekly.

It features the latest news, guides & tutorials and new open source projects. You can sign up via email below.

No spam. Just some good, practical Linux & open source content.