How to keep Googlebot from indexing slideshow.html ?
|
Jean-Marc Liotier
![]()
Joined: 2004-03-07
Posts: 80 |
Posted: Tue, 2006-05-16 12:41
|
|
In the forums I have found a few occurences of talk about keeping Googlebot from visiting pages it has no use indexing, such as slideshow.html. But I have not found the definitive recipe... Googlebot indexing slideshow.html is a heavy problem to me : it is single-handedly bringing my server to a crawl. Other pages do not seem to induce significant load, but the Googlebot's visits to slideshow.html each produce one Apache process and one Mysql process both pegged at maximum CPU usage. And since the Googlebot visits are quite frequent the processes keep piling up as more as added before the others finish processing the queries, and most of them end up being killed by the Apache's request CPU time ceiling I configured. Loads of 30 are not uncommon because of that... I have tried the following robots.txt : User-agent: googlebot Crawl-delay: 192 Disallow: User-agent: * Disallow: slideshow.html I seems that the "Crawl-delay" directive reduces the intensity of the problem but it definitely does not solve it as . But I do not want to exclude Googlebot completely - I just want to keep it from indexing the useless to him and CPU intensive slideshow.html... I am in the process of trying the following to check if the second stanza overrides the first : User-agent: * Disallow: slideshow.html But don't know yet if it will solve the problem. Has anyone managed to do that ? |
|
| Login or register to post comments |


Posts: 13452
Crawl-delay isn't supported by google. Using the disallow line should work fine, but I think you may need to include the full path
Should do the trick. I use this to prevent spidering of folders on my website, and googlebot behaves as told.
h0bbel - Gallery Team
If you found my help useful, please consider donating to Gallery
http://h0bbel.p0ggel.org
Posts: 80
The reason why I used "Disallow: slideshow.html" is that the URL in the access log are like "/main.php/v/my_album/my_sub_album/my_photo.jpg/slideshow.html". I hoped that "Disallow: slideshow.html" would match anything ending in "slideshow.html" although I have not found anything conclusive documenting that sort of behavior.
Anyway I don't quite understand how "Disallow: /gallery2/slideshow.html" is supposed to match something like "/main.php/v/my_album/my_sub_album/my_photo.jpg/slideshow.html"... Is there something I missed ?
Posts: 13452
My bad, I didn't really consider the url properly.
As far as I can tell from the robots.txt documentation, you can't disallow the way you want. You might have to do some custom rewrite rules to be able to do this.
h0bbel - Gallery Team
If you found my help useful, please consider donating to Gallery
http://h0bbel.p0ggel.org
Posts: 80
That is what I feared... Thanks for the confirmation ! Well I guess I'll have to refresh my mod_rewrite-fu a bit...
I can't believe that I am the only one to express that problem : if Googlebot hogs my server when spidering slideshow.html then it probably does the same on everyone's server...
Posts: 80
Actually there is probably no need for playing with mod_rewrite directly as the nifty rewrite module handles can the details himself.
The "View Slideshow" rewrite target is currently "v/%path%/slideshow.html". The constant slideshow URL mark ("/slideshow.html") is on the right side of the variable path ("%path%") and this is why I could not express the slideshow ban in robots.txt syntax. But if I reverse this order a simple "Disallow: /v/slideshow/" line will work.
So let's try "v/slideshow/%path%" as a rewrite target for "View Slideshow"...
Posts: 80
The rewrite works. Actually the URL resulting of the "v/slideshow/%path%" rewrite target is "/main.php/v/slideshow/%path"
So here is the robots.txt line I will use :
Now let's wait and see if it has any effect on Googlebot...
Posts: 80
I am quite happy to report that my solution is working ! I put it in the wiki : http://codex.gallery2.org/index.php/How_to_keep_robots_off_CPU_intensive_pages_that_are_useless_to_them
Now in the absence of a centralized multisite administration tool I have to deploy it on each gallery on my host... But that is another story...
Posts: 13452
Excellent. I renamed it and moved it to http://codex.gallery2.org/index.php/Gallery2:How_to_keep_robots_off_CPU_intensive_pages and added it to http://codex.gallery2.org/index.php/Gallery2:How_Tos#Performance
Thanks.
h0bbel - Gallery Team
If you found my help useful, please consider donating to Gallery
http://h0bbel.p0ggel.org
Posts: 80
Now the question is : should that be the default behavior ? I believe it should be because I see no value in letting robots index slideshow.html : to an indexing robot this page is totally useless since the information it provides is redundant. Only album pages and photo pages (both including names, titles, summaries, descriptions, keywords and comments) contain information useful to search engines. Incidentally, they are the only items shown in the Google site map produced by the eponymous module. What is good for Google is probably good for the rest of the world...
Posts: 10
I've added a link to the 'CPU intensive pages' page from http://codex.gallery2.org/index.php/Gallery2:Performance_Tips ; hopefully that will help more people find it. (I've been googling for this for months now before finding this tip.)
Jean-Marc: honestly, the performance is so bad (slideshow.html DOSd my server everytime google or yahoo indexed me) that I think slideshow should be turned off until it is robot-proofed or otherwise has the performance improved. It is unacceptable to ship software that, by default, is a DOS attack on a modern server whenever it gets indexed.
I've added a note to that effect to http://sourceforge.net/tracker/index.php?func=detail&aid=1309306&group_id=7130&atid=107130 but I have no idea if that will get read/dealt with/etc.
Posts: 500
Upgrading to latest 2.2 SVN/nightly snapshot version should improve performance _anyway_, especially with exif enabled.
linkfelhő oldal | My gallery 2 about dogs
Posts: 10
I've fixed it now, for me personally, so I don't need to upgrade to SVN. Aside from that one page, my site's performance is fine, even when getting hammered by google. My concern is fixing it before the next release, so more people aren't bitten by this- the current situation is really not good, either for the users or for gallery's reputation.