http://qs321.pair.com?node_id=258370
Category: HTML Utility
Author/Contact Info David A. Desrosiers, aka hacker
desrod at gnu-designs dot com
Description: Can-o-Raid is an offensive CGI that will pollute web-based email address harvester's data stores with thousands upon thousands of fake (non-existant) email addresses. The script is re-entrant, but doesn't look like it to the harvesters.

What it does, is generate a page of fake email addresses, which all "look" perfectly valid, but aren't. Many of the addresses shown in the page are mailto links, which lead nowhere, and others LOOK like mailto links, but are actually hrefs back into the script itself, trapping the harvester. A recent scan of my web logs shows one harvester getting 21,598 hits to this page in a night, which is roughly 4,319,600 fake email addresses that I stuffed their system with.

The benefit of this script, is that those fake email addresses will eventually overpopulate the "real" email addresses they have stored. If they sell their collection of email addresses to someone else, most of their collection will be junk, invalid. Eventually they'll have to delete their entire database of email addresses, and start again. Also, trying to deliver to a non-existant domain with a non-existant email address will slow down the delivery with millions of bogus DNS queries.

You can see this in action here. Hit reload a few times, and look VERY closely at some of those links.

This can certainly be improved and probably refactored, patches are welcome. I've forgotten where I got the idea for this, so apologies to whomever started me down this path, but here's the code thus far. Enjoy.

Update: Reduced the number of unnecessary comments (thanks halley)

Update: Added LAI's fix using map();

use strict;
use Data::Random::WordList;
use CGI qw/:standard/;
my $cgi         = CGI->new();

my ($punct,             # punctuation, . ! ? or :
    @punct,
    $punc,
    $tld);              # top-level domain

my $log         = "/webroot/logs/raid";
my $wordlist    = "/usr/share/dict/american-english";

#########################################################
# To activate this in Apache, add these two lines to your
# httpd.conf in the appropriate section:
#
# AddHandler cgi-script .cgi .pl
# AliasMatch ^/raid/.* /path/to/this/cgi/raid.pl
#             ^^^^^
my $ap_alias    = "/raid/";

#########################################################
# list of domains hosted on this machine that will always
# point to this script:
my @domains     = qw/www.foo.bar foo.bar foo.com/;

# Throttle
# sleep(10);

# wrapped for Perl Monks, this is not a multi-line regex
exit if ($ENV{HTTP_USER_AGENT} =~ /google|Googlebot|
                                   inktomi|search|
                                   altavista|wget|htdig/i);

if (-e $log ) { 
        open LOG, ">>$log";
} else {
        open LOG, ">$log";
}

my $time                = localtime;
print LOG "$time $ENV{'REMOTE_ADDR'}
      $ENV{'HTTP_HOST'}$ENV{'REQUEST_URI'}" .
    " $ENV{'HTTP_REFERER'} \"$ENV{'HTTP_USER_AGENT'}\"\n";
close LOG;

$punct[1]       = ".";
$punct[2]       = "!";
$punct[3]       = "\?";
$punct[4]       = ":"; 

my $numurl = my @url = map { "${_}${ap_alias}" } @domains;

#########################################################
# Select 'n' words at random from the list, sorted
my $wl = new Data::Random::WordList(
         wordlist => '/usr/share/dict/american-english');
my @word = $wl->get_words(2000);
$wl->close();

my $wordnum     = @word;

# Create a random title from those random words in the list
my $title       = $word[int(rand $wordnum)] . " " 
                . $word[int(rand $wordnum)] . " " 
                . $word[int(rand $wordnum)] . " " 
                . $word[int(rand $wordnum)];

print $cgi->header(), start_html(-title    => "$title",
                                 -bgcolor  => '#ffffff');

my $para        = int(rand 10)+3;
my $pagenum     = 0;

while($pagenum < $para) {
        $pagenum++;
        my $words_in_page = int(rand 80)+10;
        my $total_words = 0;
        while($total_words < $words_in_page) {
                $total_words++;
                my $prword = $word[int(rand $wordnum)];
                print "$prword";
                if((rand 10)<1) {
                        $punc = $punct[int(rand 4)+1];
                        print $punc . br() . "\n";
                }
                print " ";
        }
        print br(), "\n";
        my $num_addr = int(rand 10)+10;
        my $pres_addr = 0;
        while ($pres_addr < $num_addr) {
                my $urlpos;
                $pres_addr++;
                my $name = $word[int(rand $wordnum)];
                my $d1   = $word[int(rand $wordnum)];
                my $d2   = $word[int(rand $wordnum)];
                if((rand 4)>1) {
                        if((rand 3)>1) {
                                $tld = "com";
                        } else {
                                $tld = "net";
                        }
                } else {
                        $tld = "org";
                }
                my $mailaddr = $name . '@' . 
                               $d1 . $d2 . "." . $tld;

                if((rand 4)>3) {
                        my $urlh = "http://";
                        my $urlb = $url[int(rand $numurl)];
                        my $urlt = 
                                 $word[int(rand $wordnum)];

                        if ((rand 5)>1) {
                                $urlt .= ".html";
                        } else {
                                $urlt .= "/";
                        }
                        $urlpos = $urlh . $urlb . $urlt;

                } else {
                        $urlpos = "mailto:" . $mailaddr;
                }
                print a({-href => "$urlpos"}, 
                      "$mailaddr"), br(), "\n";
        }
}
print end_html(), "\n";
exit;
Replies are listed 'Best First'.
Re: Can-o-Raid v1.0
by halley (Prior) on May 15, 2003 at 13:30 UTC

    Spammers sometimes bumble across real addresses by randomly creating names in a way similar to yours. They don't care if it's a low-yield enterprise, as long as it's not a zero-yield enterprise. You may be slowing their spiders, but are you assisting their name guessing?

    Your legit-spider check is a multiple line regex without the /x modifier. Are you sure you're matching what you think you're matching? And what happens when spammers adjust their spiders to spoof a friendly google spider?

    Tiny nit: your comment says you're choosing 1000 words but you pick 2000 instead. Don't include any magic numbers in comments, they're just prone to become stale as you adjust the code. This falls along the lines of my maxim, "Strategy in comments, tactics in code." The code snippet in question is already near-literate, so why babble? Save the comments for things that aren't visually obvious: the setup is heavy with comments but the main loop needs a little help.

    --
    [ e d @ h a l l e y . c c ]

Re: Can-o-Raid v1.0
by LAI (Hermit) on May 15, 2003 at 16:37 UTC

    Well done, hacker. I've seen this sort of thing before... can't remember where... but the more the merrier! One patch for you: As stated, your script only ever uses the first url in the list for http:// links. This is because $numurl is set to the size of @url before @url is populated. This fixes it, plus condenses your foreach into a map, which is sexy.

    ######################################################### #my @url; my $time = localtime; print LOG "$time $ENV{'REMOTE_ADDR'} $ENV{'HTTP_HOST'}$ENV{'REQUEST_URI'}" . " $ENV{'HTTP_REFERER'} \"$ENV{'HTTP_USER_AGENT'}\"\n"; close LOG; $punct[1] = "."; $punct[2] = "!"; $punct[3] = "\?"; $punct[4] = ":"; #foreach (@domains) { # push @url, "${_}${ap_alias}"; #} # new code from LAI here: my $numurl = my @url = map { "${_}${ap_alias}" } @domains; #########################################################

    LAI

    __END__
Re: Can-o-Raid v1.0
by bronto (Priest) on May 16, 2003 at 07:41 UTC

    ++hacker, very good job.

    Just a question: how do you make sure that the domain names in generated mailto: links don't point to any existent domain? Since the address@domain.name contains real words (random, but real) I wouldn't be happy if, in a moment of bad luck, your script would generate my e-mail address to feed it to spammers :-)

    Ciao!
    --bronto


    The very nature of Perl to be like natural language--inconsistant and full of dwim and special cases--makes it impossible to know it all without simply memorizing the documentation (which is not complete or totally correct anyway).
    --John M. Dlugosz

      I'm thinking it mightn't be a terrible idea to run a cron script each night (or more often, if you've got cycles to spare) that generates a bunch of fake domains, checks them for nonexistence, and dumps them into a text file. Then you could use that as a dictionary. I'd say that would be too much overhead for an on-the-fly job, especially one designed to be hit a lot, but keeping that file filled and fresh wouldn't be hard.

      LAI

      __END__

      I don't know about how it is in the us, but as far as for germany any domain inside the TLD "de" must have at least three literals.

      So any domain name as aa.de to zz.de would be valid according to naming conventions, but never have been registered and never will be, so there are no DNS records for them. Ok, once one knows about those country specific details all parser would check for such limitations as well. But anyway, what is the worth of a "spammers mail list spammer" if the use of EMAIL::Valid and Co. will assure that such a list would be clean?

      Have a nice day
      All decision is left to your taste

Re: Can-o-Raid v1.0
by smitz (Chaplain) on May 15, 2003 at 13:56 UTC
    This *&!"£*% rules!
    Awesome job, ++
    Im off to insatll it on every web server I have EVER had access to, now what was that geocities password...

    Smitz
Re: Can-o-Raid v1.0
by newrisedesigns (Curate) on May 15, 2003 at 19:32 UTC

    Nifty and well-executed, but what if you don't have a bunch of domains to bounce around on?

    KillSpam does about the same thing, but generates self-referencing URLs that will present the spider with endless pages of itself. It removes the reliance on having multiple domains.

    John J Reiser
    newrisedesigns.com

      You didn't look close enough at the mailto tags =)

      Note that the domains listed in @domains are LOCAL domains to protect against, not domains referenced in the urls matched under it. This script is a honeypot, and unless the harvester is smart, they can't get out of it. It's completely self-reentrant.

Re: Can-o-Raid v1.0
by Aristotle (Chancellor) on May 16, 2003 at 05:41 UTC
Re: Can-o-Raid v1.0
by ok (Beadle) on May 16, 2003 at 13:16 UTC
    Beautiful concept, but who's paying for the bandwidth?