Google is complaining Smugmug is not reachable

Ferguson · May 23, 2013

I got a warning mail from Google that my site's robot.txt file had dropped below 67% reachable. On looking further it appears to have escalated to 100% UN-reachable.

I only vaguely know what this means, but it sure sounds like something systematic at Smugmug? is it?

I don't have a clue how to approach fixing it. And when I look at the webmaster tools on Google, it also shows server errors starting back at the end of March, with the robot.txt problem starting in May.

Is there some action I should be taking?

Is this something configurable, or is this all your infrastructure?
:scratch

Ferguson · May 23, 2013

Sorry, in case not obvious from the signature, the site is captivephotons.com which translates to LinwoodFerguson.Smugmug.com

mrneutron · May 23, 2013

Recently Google's web crawling robot has been sending SmugMug customers over 3x the normal levels of traffic. SmugMug restricts the level of bot traffic to prioritize actual usage, and when bots go over a certain level of traffic we respond with an HTTP code "503" which advises Google's robot to come back later. This tracks Google's recommendations for responses to overaggressive bot traffic. The downside of this is that Google's webmaster tools needlessly alerts users that their site is responding with "come back later". This would be a concern if it were happening with human traffic, but with bot traffic it's not a problem.

Ferguson · May 23, 2013

mrneutron wrote: »

Recently Google's web crawling robot has been sending SmugMug customers over 3x the normal levels of traffic. SmugMug restricts the level of bot traffic to prioritize actual usage, and when bots go over a certain level of traffic we respond with an HTTP code "503" which advises Google's robot to come back later. This tracks Google's recommendations for responses to overaggressive bot traffic. The downside of this is that Google's webmaster tools needlessly alerts users that their site is responding with "come back later". This would be a concern if it were happening with human traffic, but with bot traffic it's not a problem.

Well, from Google's perspective (i.e. the graphs I showed) they are not just having to come back later, it has ramped up to a 100% failure rate.

Should your subscribers be concerned about your capacity, given it seems to be a choice between Google Crawls, and subscriber access? That's an ugly choice, given how long and loud complaints have been about google search results on Smugmug sites. Yes, I know you can show me lots of high ranked sites, just saying that subscribers complaining about that, while simultaneously Google is complaining about inability to access Smugmug, is a bit ugly.

Ferguson · May 25, 2013

It's continuing, now going on 6 days with the last 3 over 75% failure.

mbonocore · May 27, 2013

Ferguson,

Could you report your last 3 days of Google Webmaster Robot.txt errors? The more screenshots the better.

Thank you!

Michael

Ferguson · May 27, 2013

mbonocore wrote: »

Ferguson,

Could you report your last 3 days of Google Webmaster Robot.txt errors? The more screenshots the better.

Thank you!

Michael

Thanks for staying after it.

Not much better, and the server errors are getting worse (not actually sure what they mean).

Ferguson · May 27, 2013

Does FetchAsGoogle help?
I don't use this normally so not sure how to interpret it, but not one attempt at accessing my site from Google worked.

Ferguson · May 27, 2013

Fetch as Google
Here's another site same approximate time and system, so there's something unusual about Smugmug. I also checked the robots crawl section and it shows no errors there at all.

mbonocore · May 28, 2013

Ferguson,

My OPS team looked into this and informed me that Everything looks good from our end.
We've logged about 2800 bot page hits on www.captivephotons.com and
1600 from googlebot in the past week. Another thing to try is to see if webmaster tools reports better
results if you register as "www.captivephotons.com" instead of
"captivephotons.com". Google might not be happy with an extra
redirect.

Can you try to register the captivephotons.com instead and let me know if this helps?

Thanks!

Michael

shandrew · May 29, 2013

Bonocore meant to write "Can you try to register www.captivephotons.com instead" on Google's webmaster tools.

Ferguson · May 29, 2013

shandrew wrote: »

Bonocore meant to write "Can you try to register www.captivephotons.com instead" on Google's webmaster tools.

I got that part, need to figure out how -- not sure if this is the analytics piece itself, or a registration I may have done and forgotten for search. Is this just the analytics piece and the code I insert for that?

shandrew · May 30, 2013

Go to Google webmaster tools https://www.google.com/webmasters/tools , select "ADD A SITE", enter www.captivephotons.com, and verify it using the tag method or analytics method (whichever you used before should work fine. for tag verfication, you would need to replace the tag in Account Settings->Advanced customization->Head Tag).

Ferguson · May 31, 2013

shandrew wrote: »

Go to Google webmaster tools https://www.google.com/webmasters/tools , select "ADD A SITE", enter www.captivephotons.com, and verify it using the tag method or analytics method (whichever you used before should work fine. for tag verfication, you would need to replace the tag in Account Settings->Advanced customization->Head Tag).

Thanks I did figure it out, but I remain a bit confused about what I see.

First, so far the fetch-as-google will get robots.txt, and it showed no crawl errors but it only has one day's data point so far. It does show two soft (404) errors, which is strange, but I think too early.

What I don't understand is that it shows "Total Indexed" pages at zero (well, actually there are no data points over the last year, but there are data points for "blocked by robots" in that period in the account with "www" attached). In the account without "www" attached it shows about 12,000 to 18,000 indexed (strangely decreasing though). So all the indexing has been showing up under the non-"WWW" account, even though all the links are to the "WWW" account inside of Smugmug (i.e. if you load a page and follow from one to the next).

I don't quite know what to make of that -- it seems as though Google in some form or fashion removes the WWW when it does its indexing? Even though Smugmug requires it?

Ferguson · June 1, 2013

I'm still trying to figure this out.

I added a separate webmaster account for www.captivephotons.com as well as without the www.

The former is appropriately loading robots.txt, but it is not showing any indexing statistics.

Crawl errors are 3 (soft 404), and Robots.txt is fine on the www site. It continues not to be loaded consistently on the one without www. This is after 3 days of data.

But indexing status on the www site is zero across the board. The indexing status of the one without www continues to have reasonable data (circa 18,000), interestingly enough dropping by about 4000 over the time the robots.txt has been failing.

Since Google allows you to specify that the domain (without www) is displayed with the www, I'm not at all sure I understand the difference in having them in both ways.

But it concerns me that one is trending down continuing to show crawl errors, and the other is showing no indexing.

Google is complaining Smugmug is not reachable

Comments