(This is a part of the Spambot Beware site)
This section explains about detecting spambots - why you would want to, and some ways that you can. Most of the more advanced detection tricks require access to CGI and your raw access logs.
Why do you want to detect spambots? To put it simply, knowledge is power. Besides, it's always nice to know when people are abusing your site by running a spambot through it. Detecting them also helps you refine your anti-spambot tricks, by knowing where and how often they strike. It also makes it easier to refine your pages so that normal users are not affected as much by the spambots.
One of the best ways to detect spambots is to have more than one email, then look carefully at your spam and see which address it was sent to. These days, there are a number of free email services you can use. (Note, however, that most of these services have spam filtering, so you may not receive the spam even if it is sent). Take an email account that you do not use much, and put it on a webpage. Don't give it out anywhere else. When you receive spam to this address, you know how and where the spammer got your email address.
A "plussed" email address may not be available on all systems. When in doubt, send yourself some email to test it! A plussed email address in one in which a plus sign, plus some other letters, are added after the username. The email is still delivered as if the plus and the other letters are not there. You can then look at your email and see what is after the plus. For example, if you email address was "bill@abcdefg.com", you could also use "bill+spamtrap@abcdefg.com", "bill+monkey@abcdefg.com", or even "bill+FromMyWebpage@abcdefg.com".
Another way to do this is to use SMTP comments in the email address.
The comment ocurs in parenthesis, and apear like this:
"bill(spamtrap)@abcdefg.com" or
"bill@abcdefg(spamtrapper2).com"
both of which are the same as "bill@abcdefg.com"
- the items in the parenthesis are ignored, similar to the
plus method above. Note that some spambots may not pick up
the email address if it has a parenthesis in it, but most
probably will.
A very nice way to detect not only where a spambot is getting your email addresses, but when, is to use dynamic email addresses. The idea is to have the addresses on the page change over time, allowing you to tell when it was taken from the page. Using the plussed email is one way - another is if you own a site, you can create random email addresses to populate it. The following program is run as a cronjob, and changed the email address to a pseudo-random name every hour. The actual web page is changed, so that spam can be tracked to the very hour in which the address was stolen. Note that the email addresses generated are of the form XXYY@, where XX is a random alphabetic string, and YY is the hour (from 0-23). The source code:
By using a page that only spambots are likely to visit, you can keep track of them by not only checking the access log for that page, but by having a small CGI script that writes a log of all the viewers of the page. You could also have the script triggered by certain actions on a page, or even by checking the path that was taken to get there. See the section on luring spambots in for more ideas.
Some spambots are bold enough (or dumb enough) to announce their precense. Look in your access logs for suspicious looking USER_AGENT fields. One clue is a small number of different IP addresses using a particular agent. You also will see some entries for "good robots", like these:
For an excellent list of web robots, take a look at the The Web Robots Database. Notice how the Inktomi robot even leaves a web address about itself - very nice. If only they all did that...
For a good list of spambots and ways to configure your webserver to do something about them, see Protect your Webserver From Spam Harvesters.
Spambots tend to be very impolite robots - not only do they tend to ignore robots.txt, but they greedily grab many web pages, without even waiting a bit (most robots have a small delay between "fetches" to avoid slowing down the server). This can be used to your advantage by looking at your access logs for not only IP addresses that hit many of your pages, but that did so in a short amount of time. Basically, spambots will have a high number of hits, and a short time between hits.
Spambots do not care about images (most robots do not, actually). Since they cannot view them, there is no point in downbloading them as it only wastes the spambots time. This is another good way to detect robots: take a look at the access log and see who is not loading images. Most users nowadays browse with a graphical browser, as well as browse with images turned off. If you have only a few small images, your users are much more likely to leave the images turned on. Here's a small script to list the users who are not loading images:
The path a user takes through a site can be very instructive. Most users tend to follow a few common paths. Spambots can be detected because they do not follow links based on the content of the link, like a real user would, but follows merely based on the placement. It will usually be linear: each link on the main page will be hit in order, then each on the second page, etc. Detecting a spambot becomes easy by looking for that path. Here is a small script that does just that:
Here is a small program to show you path a certain user has taken through a site:
This is similar to the idea above of detection by using cgi traps. In this case, just search (grep is good) the access log for the "trap" URL. See the page on luring spambots in for some ideas on how to limit URLs to only spambots.
Spambot Beware: Main page <> Detection <> Avoidance <> Harassment <> Glossary
Written by Greg Sabino Mullane (greg "at" turnstep.com). Last update March 30, 2003.