If a URL is never indexed, and no one visits it, does it even exist?
These kind of existential nerd-questions make you wonder about virgin URLs. Do they even exist, URLs that have never been visited? And as a result, what exactly happens if you hit publish on your blog and send your freshly written words out onto the World Wide Web?
I’ll dive deeper into the access logs and find out what happens in the first 10 minutes when a new URL is introduced onto the web.
Disclaimer: testing with a WordPress website
I’m testing with my own website, which is WordPress based. WordPress is pretty smart when it comes to publishing new posts, it pings the right search engines and social networks (if enabled) and lets the world know you have new written new content.
If you’re writing this on a static site generator or just plain ol’ HTML, your mileage may vary.
These tests were conducted without any social network pinging, I did leave the default “Update Services” list enabled. And as it turns out, that’s a pretty damn long list.
The first 5 hits
You’ll probably see these couple of hits as the first ones in your access logs:
- Your own: don’t deny it, you’re reading your blogpost again. Still doubting if your should have hit Publish after all.
- Tiny Tiny RSS hits: this self-hosted RSS reader is crazy popular. Chances are, you’ll see quite a few of these hits, since every self-hosted TT-RSS does its own fetching.
- Crazy requests from OVH: this French hosting provider has a lot of servers. I can’t tell why these hits even come, but they’re masquerading as legit users with “valid” User-Agents. Pretty sure these are crawlers for some service, but haven’t figured out which one.
- GoogleBot: within 10 minutes, Google will have come by and has indexed your newly written page.
- Feedburner & other RSS aggregators: since these all run automated, they are naturally your first visitors. Every RSS feed that’s subscribed to your feed will come crawling by. Most of the popular ones have received a ping from WordPress to notify new content has been published.
The requests look like this.
$ tail -f access.log You: 126.96.36.199 "GET /your-url/ HTTP/1.1" 200 11395 "-" "Mozilla/5.0" Tiny Tiny RSS feeds: 188.8.131.52 "GET /your-url/ HTTP/1.1" 200 11250 "-" "Tiny Tiny RSS (http://tt-rss.org/)" Feedburner: 184.108.40.206 "GET /your-url/?utm_source=feedburner&utm_medium=feed HTTP/1.1" 206 11250 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64)" GoogleBot 220.127.116.11 "GET /your-url/ HTTP/1.1" 200 11250 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
The first few hits aren’t really special. The extravaganza comes in the next part.
Google is crazy fast
Considering the volume of data they process on a daily basis and the amount of search queries they get per second, it’s hard to image there’s room to index more.
After a single minute, the blogpost shows up as part of the Google+ social network search results.
And within 20 minutes, Google’s “normal” search results have the page completely indexed.
This just blows my mind.
Social Sharing: Bot Heaven
You want to drive traffic to your new blog post, so you share it on your Facebook. And your Twitter. And LinkedIn. And Google+.
As soon as you share your new URL on social media, your server receives multiple HTTP requests in order to create a description, title and cover image for that social network.
Each of those services fetches your URL and means at least one HTTP call to get your content. Probably more, as they continue to fetch images, stylesheets, check your robots.txt, …
- “Mozilla/5.0 (compatible; redditbot/1.0; +http://www.reddit.com/feedback)": the Reddit Bot, fetching an image + title suggestions.
- “Mozilla/5.0 (TweetmemeBot/4.0; +http://datasift.com/bot.html)": yet another Twitter bot.
- “Twitterbot/1.0”: the actual twitter bot, used to create the Twitter Cards.
- "(Applebot/0.1; +http://www.apple.com/go/applebot)": the Siri & Spotlight searchbot, soon to power the News App on IOS 9.
- “Google (+https://developers.google.com/+/web/snippet/)": Google’s snippet generator for services like Google+.
- “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)": Facebook fetching your URL to generate a preview in the News Feed.
- "Googlebot-Image/1.0": if your post contained images, this little bot just came by to index them. It’s a separate request from the Googlebot that came by earlier.
Besides those well know bots, you’ll see a flurry of bots you’ve never heard of or thought were extinct: Zemanta Aggregator, BingBot, Slackbot-LinkExpanding, MetaDataParser, BazQux, Apache-HttpClient (JAVA implementations for HTTP fetchers), Ahrefs Bot, BrandWatch bot, Baiduspider, worldwebheritage.org, MJ12bot, YandexBot, FlipboardProxy, Lynx/Curl (for monitoring), …
It’s a bot party and everyone’s invited!
Within 10 minutes, I had a total of 33 bots visit the new URL. Most of these I have never heard of.
There’s No Such Thing As a Virgin URL
Checking my own access logs, I don’t believe there are URLs that have never been visited.
They may not have been visited by real humans, but they’ve been indexed at least a dozen times. By services you don’t know. That treat your data in ways you don’t want to know.
It seems strange to have all these kind of services do their own indexing. Imagine the overhead and waste of bandwidth, CPU cycles and disk space each of these bots is consuming. Since most of them are offering competing services, they’ll never work together and join forces.
The amount of bot crawling will only increase. On small to medium sized sites, the amount of bot crawling can exceed the regular visitors by a factor of 3x or 4x.
That means if Google Analytics is showing you an average of 1.000 pageviews a day, you can safely assume your server is handling 3.000 to 4.000 pageviews, just to serve the bots, RSS aggregators and link fetchers.
It’s a crazy world we live in.