I've discovered this:
- GoogleBot is not only a bunch of servers (obviously). It is a very big distributed cluster with hundreds of machines. My site is indexed from more than 900 different Google IP addresses every day.
- I've identified 7 different GoogleBot crawling clusters.
- They seem to connect to my site from 6 different locations.
- Almost all of them are in USA but one location is Europe.
Origin IP
With access to your web site log files you can "grep" the string "http://www.google.com/bot.html" on the referrer field and find out which IP GoogleBot is using when it pays you a visit. There are some other malicious crawlers that fake their referrer as GoogleBot but they're easily spotted. Google Inc. owns the Autonomous System AS15169 and its connections come from there. In my case I got connections from those IP ranges below, during the last six months:
66.249.72.xxx 66.249.73.xxx 66.249.74.xxx 66.249.75.xxx 66.249.76.xxx 66.249.78.xxx 66.249.80.xxx 66.249.81.xxx 66.249.82.xxx 66.249.83.xxx 66.249.84.xxx 66.249.85.xxx |
Google advises that the way to find out if an Origin IP belongs to GoogleBot is to do a reverse DNS resolution and look for crawl-xx-xxx-xx-xxx.googlebot.com in the result.
Applying that the final list to work with is:
66.249.72.xxx 1.72.249.66.in-addr.arpa domain name pointer crawl-66-249-72-1.googlebot.com. 66.249.73.xxx 1.73.249.66.in-addr.arpa domain name pointer crawl-66-249-73-1.googlebot.com. 66.249.74.xxx 1.74.249.66.in-addr.arpa domain name pointer crawl-66-249-74-1.googlebot.com. 66.249.75.xxx 1.75.249.66.in-addr.arpa domain name pointer crawl-66-249-75-1.googlebot.com. 66.249.76.xxx 1.76.249.66.in-addr.arpa domain name pointer crawl-66-249-76-1.googlebot.com. 66.249.77.xxx 1.77.249.66.in-addr.arpa domain name pointer crawl-66-249-77-1.googlebot.com. 66.249.78.xxx 1.78.249.66.in-addr.arpa domain name pointer crawl-66-249-78-1.googlebot.com. |
I don't take into account the networks 66.249.80.xxx, 66.249.81.xxx, etc. because seem to be used by Feedfetcher-Google and Mediapartners-Google (AdSense) and that's out of the scope of this post.
Latency = Hint
Nowadays is tricky to know where a IP is located when it belongs to a big network. Anycast routing method (like the one used with the popular Google Public DNS Service 8.8.8.8) becomes a challenge if you want to be certain. Google Inc. IP addresses are administrative located at Mountain View, California and without any further analysis this is the conclusion you will get.
But when I ping those networks from my server (Paris, France), write the obtained round trip times on a table and give a look to the Google Data Centers map... One can guess and approximated geographic location for those GoogleBot clusters:
IPv4 Network | Ping Round Trip | Location |
66.249.72.2 |
92 ms
|
USA East Coast ? |
66.249.73.2 |
114 ms
|
USA Mid West ? |
66.249.74.2 |
152 ms
|
USA West Coast ? |
66.249.75.2 |
96 ms
|
USA East Coast ? |
66.249.76.2 | (Not active since 2013-05-29) |
Unknown |
66.249.77.2 |
274 ms
|
Unknown (Not USA nor Europe ?) |
66.249.78.2 |
13 ms
|
Dublin, Ireland ? |
Round Trip milliseconds is not an accurate method to place a system on the map but the answer I'm trying to answer here is whether GoogleBot is at California or not. As you see, there is not a short answer but at least we know that it is spread around different locations within the States and Europe.