http://qs321.pair.com?node_id=258370
Category: HTML Utility
Author/Contact Info David A. Desrosiers, aka hacker
desrod at gnu-designs dot com
Description: Can-o-Raid is an offensive CGI that will pollute web-based email address harvester's data stores with thousands upon thousands of fake (non-existant) email addresses. The script is re-entrant, but doesn't look like it to the harvesters.

What it does, is generate a page of fake email addresses, which all "look" perfectly valid, but aren't. Many of the addresses shown in the page are mailto links, which lead nowhere, and others LOOK like mailto links, but are actually hrefs back into the script itself, trapping the harvester. A recent scan of my web logs shows one harvester getting 21,598 hits to this page in a night, which is roughly 4,319,600 fake email addresses that I stuffed their system with.

The benefit of this script, is that those fake email addresses will eventually overpopulate the "real" email addresses they have stored. If they sell their collection of email addresses to someone else, most of their collection will be junk, invalid. Eventually they'll have to delete their entire database of email addresses, and start again. Also, trying to deliver to a non-existant domain with a non-existant email address will slow down the delivery with millions of bogus DNS queries.

You can see this in action here. Hit reload a few times, and look VERY closely at some of those links.

This can certainly be improved and probably refactored, patches are welcome. I've forgotten where I got the idea for this, so apologies to whomever started me down this path, but here's the code thus far. Enjoy.

Update: Reduced the number of unnecessary comments (thanks halley)

Update: Added LAI's fix using map();

use strict;
use Data::Random::WordList;
use CGI qw/:standard/;
my $cgi         = CGI->new();

my ($punct,             # punctuation, . ! ? or :
    @punct,
    $punc,
    $tld);              # top-level domain

my $log         = "/webroot/logs/raid";
my $wordlist    = "/usr/share/dict/american-english";

#########################################################
# To activate this in Apache, add these two lines to your
# httpd.conf in the appropriate section:
#
# AddHandler cgi-script .cgi .pl
# AliasMatch ^/raid/.* /path/to/this/cgi/raid.pl
#             ^^^^^
my $ap_alias    = "/raid/";

#########################################################
# list of domains hosted on this machine that will always
# point to this script:
my @domains     = qw/www.foo.bar foo.bar foo.com/;

# Throttle
# sleep(10);

# wrapped for Perl Monks, this is not a multi-line regex
exit if ($ENV{HTTP_USER_AGENT} =~ /google|Googlebot|
                                   inktomi|search|
                                   altavista|wget|htdig/i);

if (-e $log ) { 
        open LOG, ">>$log";
} else {
        open LOG, ">$log";
}

my $time                = localtime;
print LOG "$time $ENV{'REMOTE_ADDR'}
      $ENV{'HTTP_HOST'}$ENV{'REQUEST_URI'}" .
    " $ENV{'HTTP_REFERER'} \"$ENV{'HTTP_USER_AGENT'}\"\n";
close LOG;

$punct[1]       = ".";
$punct[2]       = "!";
$punct[3]       = "\?";
$punct[4]       = ":"; 

my $numurl = my @url = map { "${_}${ap_alias}" } @domains;

#########################################################
# Select 'n' words at random from the list, sorted
my $wl = new Data::Random::WordList(
         wordlist => '/usr/share/dict/american-english');
my @word = $wl->get_words(2000);
$wl->close();

my $wordnum     = @word;

# Create a random title from those random words in the list
my $title       = $word[int(rand $wordnum)] . " " 
                . $word[int(rand $wordnum)] . " " 
                . $word[int(rand $wordnum)] . " " 
                . $word[int(rand $wordnum)];

print $cgi->header(), start_html(-title    => "$title",
                                 -bgcolor  => '#ffffff');

my $para        = int(rand 10)+3;
my $pagenum     = 0;

while($pagenum < $para) {
        $pagenum++;
        my $words_in_page = int(rand 80)+10;
        my $total_words = 0;
        while($total_words < $words_in_page) {
                $total_words++;
                my $prword = $word[int(rand $wordnum)];
                print "$prword";
                if((rand 10)<1) {
                        $punc = $punct[int(rand 4)+1];
                        print $punc . br() . "\n";
                }
                print " ";
        }
        print br(), "\n";
        my $num_addr = int(rand 10)+10;
        my $pres_addr = 0;
        while ($pres_addr < $num_addr) {
                my $urlpos;
                $pres_addr++;
                my $name = $word[int(rand $wordnum)];
                my $d1   = $word[int(rand $wordnum)];
                my $d2   = $word[int(rand $wordnum)];
                if((rand 4)>1) {
                        if((rand 3)>1) {
                                $tld = "com";
                        } else {
                                $tld = "net";
                        }
                } else {
                        $tld = "org";
                }
                my $mailaddr = $name . '@' . 
                               $d1 . $d2 . "." . $tld;

                if((rand 4)>3) {
                        my $urlh = "http://";
                        my $urlb = $url[int(rand $numurl)];
                        my $urlt = 
                                 $word[int(rand $wordnum)];

                        if ((rand 5)>1) {
                                $urlt .= ".html";
                        } else {
                                $urlt .= "/";
                        }
                        $urlpos = $urlh . $urlb . $urlt;

                } else {
                        $urlpos = "mailto:" . $mailaddr;
                }
                print a({-href => "$urlpos"}, 
                      "$mailaddr"), br(), "\n";
        }
}
print end_html(), "\n";
exit;