Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

norobotlog

by quartertone (Initiate)
on Sep 06, 2004 at 01:42 UTC ( #388697=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts/Text Processing/Miscellaneous
Author/Contact Info Gary C. Wang (gary at quartertone.net)
www.quartertone.net
Description: I always look at my Apache server log files from the command line. It always bothered me to see "GET /robots.txt" contaminating the logs. It was frustrating trying to visually determine which were crawlers and which were actual users. So I wrote this little utility, which filters out requests were made from IP addresses which grab "robots.txt". I suspect there are GUI log parsers that might provide the same functionality, but 1) i don't need something that heavy, 2) I like to code, 3) imageekwhaddyawant.
#!/usr/bin/perl
use strict;
use warnings;
# Apache logs robots filter-outer
# Author: Gary C. Wang
# Contact: gary@quartertone.net
# Website: www.quartertone.net
# Filename: norobotlog
#
# Usage: norobotlog [logfile_name]
#
# This script parses Apache log files and 
#   filters out entries from IP addresses 
#   that request "robots.txt" file, commonly
#   associated with webcrawlers and site indexers.
# Prior to usage, check regexp to make sure it matches your log format
+.
# My log format is something like:
#  192.168.0.xx - - [11/Jul/2004:22:25:22 -0400] "GET /robots.txt HTTP
+/1.0" 200 78

my %robots;
my $ip_ptn = '((\d{1,3}\.){3}\d{1,3})'; # this regexp matches IP addre
+sses
my @file = <>; #file from stdin

# First, find out which IPs are associated with crawlers
foreach (@file) {
    # ----- Adjust this pattern to match your log file -----
    $robots{$1} ++ if m/^$ip_ptn .+?robots\.txt/;
}

# Then weed those out, printing only the ones that do not request robo
+ts.txt
foreach (@file) {
    if (m/$ip_ptn /) {
        print if ! defined $robots{$1};
    }
}
Replies are listed 'Best First'.
Re: norobotlog
by sintadil (Pilgrim) on Sep 11, 2004 at 13:46 UTC

    It may be a good idea to include other bot patterns, like the Googlebot and other search engine bots. Otherwise, this can be simplified to an egrep command, which is what I'd use anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://388697]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2020-10-27 13:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (256 votes). Check out past polls.

    Notices?