Many bad bots either carry wellknown bad user-agent names (well, most modern bad bots don't
and hide behind faked UA strings (either other bots' names or user browser UA stings
(like "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" et al.),
but some old silly ones still do, like "Internet Explore 5.x",
"Mozilla/3.0 (compatible; Indy Library)") or ignore the
www.robotstxt.org/wc/exclusion.html robots.txt standard.
Originally, the Indy Library is a programming library which is
available at http://www.nevrona.com/Indy or http://indy.torry.net
under an Open Source license. This library is included with Borland
Delphi 6, 7, C++Builder 6, plus all of the Kylix versions.
Unfortunately, this library is hi-jacked and abused by some Chinese spam bots.
All recent user-agents with the unmodified "Indy Library" string were
of Chinese origin.
The following may help against rude and robots.txt-ignorant bots. However, there are more bad but smarter bots out there, which need more sophisticated countermeasures.
User-agent: * Disallow: /bot-trap/
<a href="/bot-trap/"><img src="images/pixel.gif" border="0" alt=" " width="1" height="1"></a>and wait watching your webserver's log to see who gets trapped.
<html>
<head><title> </title></head>
<body>
<p>There is nothing here to see. So what are you doing here ?</p>
<p><a href="http://your.domain.tld/">Go home.</a></p>
<?php
/* whitelist: end processing end exit */
if (preg_match("/10\.22\.33\.44/",$_SERVER['REMOTE_ADDR'])) { exit; }
if (preg_match("Super Tool",$_SERVER['HTTP_USER_AGENT'])) { exit; }
/* end of whitelist */
$badbot = 0;
/* scan the blacklist.dat file for addresses of SPAM robots
to prevent filling it up with duplicates */
$filename = "../blacklist.dat";
$fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
while ($line = fgets($fp,255)) {
$u = explode(" ",$line);
$u0 = $u[0];
if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
}
fclose($fp);
if ($badbot == 0) { /* we just see a new bad bot not yet listed ! */
/* send a mail to hostmaster */
$tmestamp = time();
$datum = date("Y-m-d (D) H:i:s",$tmestamp);
$from = "badbot-watch@domain.tld";
$to = "hostmaster@domain.tld";
$subject = "domain-tld alert: bad robot";
$msg = "A bad robot hit $_SERVER['REQUEST_URI'] $datum \n";
$msg .= "address is $_SERVER['REMOTE_ADDR'], agent is $_SERVER['HTTP_USER_AGENT']\n";
mail($to, $subject, $msg, "From: $from");
/* append bad bot address data to blacklist log file: */
$fp = fopen($filename,'a+');
fwrite($fp,"$_SERVER['REMOTE_ADDR'] - - [$datum] \"$_SERVER['REQUEST_METHOD'] $_SERVER['REQUEST_URI'] $_SERVER['SERVER_PROTOCOL']\" $_SERVER['HTTP_REFERER'] $_SERVER['HTTP_USER_AGENT']\n");
fclose($fp);
}
?>
</body>
</html>
<?php include($DOCUMENT_ROOT . "/blacklist.php"); ?>Here is the text of the blacklist.php include file to exclude the bad ones:
<?php
$badbot = 0;
/* look for the IP address in the blacklist file */
$filename = "../blacklist.dat";
$fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
while ($line = fgets($fp,255)) {
$u = explode(" ",$line);
$u0 = $u[0];
if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
}
fclose($fp);
if ($badbot > 0) { /* this is a bad bot, reject it */
sleep(12);
print ("<html><head>\n");
print ("<title>Site unavailable, sorry</title>\n");
print ("</head><body>\n");
print ("<center><h1>Welcome ...</h1></center>\n");
print ("<p><center>Unfortunately, due to abuse, this site is temporarily not available ...</center></p>\n");
print ("<p><center>If you feel this in error, send a mail to the hostmaster at this site,<br>
if you are an anti-social ill-behaving SPAM-bot, then just go away.</center></p>\n");
print ("</body></html>\n");
exit;
}
?>
[Please don't try to locate and access my /bot-trap, as your IP address will be locked
out immediately...].
Another method than the PHP stuff above (and working even better for your entire site) is to enter bad bots into the .htaccess file as follows (if you have access to your .htaccess and the Apache's configuration allows the following commands):
SetEnvIfNoCase User-Agent "Indy Library" bad_bot SetEnvIfNoCase User-Agent "Internet Explore 5.x" bad_bot SetEnvIf Remote_Addr "195\.154\.174\.[0-9]+" bad_bot SetEnvIf Remote_Addr "211\.101\.[45]\.[0-9]+" bad_bot Order Allow,Deny Allow from all Deny from env=bad_bot
To automate this process, you may adapt the techniques of (4) to
write to your .htaccess file automatically out of your /bot-trap.
But you have to be sure about what you are really doing: if you screw up
your .htaccess, your site will be dead.
If you have full access to your .htaccess, there are much more possibilities
you can do. You may read
www.webmasterworld.com/forum23/1281.htm for more details.
Here is a list of spiders and bots seen so far on kloth.net and kloth.org ignoring the robots.txt standard.
Comments and amendments are appreciated.
Please let me know, if there are other methods.
Send your comments to
<
hostmaster at kloth.net
>