Preventing a server melt-down and saving resources: Block all Shit bot

Oct 11th, 2012 | By | Category: Linux / Freebsd

Alright, I woke up today to a load average of 75.. Seventy-fucking-FIVE.  I didn’t even know that shit was possible and I’m pretty sure the guys at the DC were probably roasting marshmellows over my glowing box.  I couldn’t believe it so into my server I went, waiting 5 minuets for every command I typed and shut down apache, nginx, and myself as those are usually at the center of a shit storm like this(it is a webserver afterall running 100+ websites.  Anyways in my access logs I just saw a ton of entries showing the bullshit baidu spider crawling away.  Leave it to the chinese to fuck with my shit.  I’ve seen a lot of pings from Baidu since forever but never thought too much of it but I finally decided to do something about this parasite and while I was at it any other parasite that was leeching my server dry.

If you don’t know what a bot/spider is then there’s a great website called Google and google has all your answers.

Anyways, I wanted a way to block Baidu and Yandex and all the other shit spiders that rip my server to pieces.  Sorry people in China but A.) Your search engines suck balls and you are getting penalized becaues of it and B.)Your government sucks even more balls and blocks you access to Google and any other search engine that might allow people to learn how much balls they suck.  Consequentially you’ll no longer be able to search my shit.  I know I know my loss but some how I’m going to just have to learn to deal with it.  I’ll shed a single tear in your absence

Anyways, back to the problem at hand.. I needed a way to block these bots server wide.  I have 100+ vhost entries in my vhost.conf file and while at first I used sed to insert the necessary code in each virtualhost entry I found a much easier way to go about doing it on a global level.  You’ll need to do 2 things:

  1. Create a robots.txt file in /var/www/ or /usr/home/ or wherever tickles your fancy and put the following in it:

User-Agent “^BaiDuSpider”
Disallow: /

User-Agent “^Yandex”
Disallow: /

User-Agent “^Exabot”
Disallow: /

User-Agent “^Cityreview”
Disallow: /

User-Agent “^Dotbot”
Disallow: /

User-Agent “^Sogou”
Disallow: /

User-Agent “^Sosospider”
Disallow: /

User-Agent “^Twiceler”
Disallow: /

User-Agent “^Java”
Disallow: /

User-Agent “^YandexBot”
Disallow: /

User-Agent “^bot*”
Disallow: /

User-Agent “^spider”
Disallow: /

User-Agent “^crawl”
Disallow: /

User-Agent “^NG\ 1.x (Exalead)”
Disallow: /

User-Agent “^MJ12bot”
Disallow: /

2.)Put the following in your modules.conf file(if that’s how you have it setup) or in your mods-enabled/alias.conf file or your apache2.conf file or lately it could go in your httpd.conf file:

Alias /robots.txt /var/www/robots.txt

Restart apache and viola you are home free.  load average before this, seventy-fucking-five(75), load average now 1.03.

 

Bonus section:

If for whatever reason you do want to use mod rewrite to do this go into the directory where your vhost.conf file or or wherever the file that has all of your virtualhost info is for all of your domains and use sed like so:

 sed -i “s/80>/80>\nRewriteEngine on\nRewriteRule \^\/robots.txt\$ \/var\/www\/robots.txt \[NC\, L\]\n/g” vhost.conf

*WARNING: BACKUP YOU FILE FIRST SO YOU CAN UNDO THIS SHIT IF SOMETHING GOES WRONG*

Less intense warning: You might want to run the above command with -e instead of -i first just so you can preview the changes.

Again this is dependent on how you have your server setup.  If 80 is the port that all your vhosts map to then great.  If you are using nginx like me well then it’s goign to be 8080 so you ‘ll need to change those two 80’s to two 8080’s.  If you have some domains running SSL on 443 then you’ll need to rerun this with 443 in place of 80.  Oh and you should be using bash shell with this, I have no idea if this syntax will work in tsh or csh.  Of course you’ll still need that robots.txt file mentioned above.

 

Hope that helps, enjoy!

Tags: , , , , , , , , , ,

3 Comments to “Preventing a server melt-down and saving resources: Block all Shit bot”

  1. Chris B says:

    Good POST I am getting totally pissed off with crap search engines and comment SPAMMERS trying to feed their foreign language shit onto my sites in the hope Baidu or Yandex pick up the shitty crap.

  2. Simon says:

    Baidu honors robots.txt. But I have noticed several other crawlers masquerading as Baidu that do not honor it. Are you still seeing any traffic that claims to be Baidu? I still get quite a lot of traffic from the 61.135.190.0/24 subnet that claims to be Baidu, but it doesn’t look like that subnet is owned by them. Just have to block these aberrants using .htaccess.

  3. admin says:

    Not 100% sure.. Since then I’ve put a custom solution in place that will auto-ban IP’s if they exceed threshholds that I deem safe and normal.

Leave a Comment