Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

auden.pl: a poetry scraper

by eweaverp (Scribe)
on Jul 15, 2004 at 19:42 UTC ( [id://374801]=sourcecode: print w/replies, xml ) Need Help??
Category: Fun Stuff
Author/Contact Info
Description: Grabs the complete works of any poet on plagiarist.com and sticks them in a text file (alphabatized). Includes _PAGEBREAK_ markers for your favorite word processor's find/replace command.

Quick and dirty.

Invoke with  ./auden.pl NumberOfPoetAccordingToPlagiaristCom OutputFilePrefix

i.e.: ./auden.pl 91 jeffers_robinson
#!/usr/bin/perl

use strict;
use warnings;

my $poet = $ARGV[0];
my $name = $ARGV[1];

if (!$poet || !$name) {
    print "No poet specified.";
    exit();
}

my $flag = 1;
my $count = 1;
my %hash;

print "Going to retrieve poet number $poet, who you say is $name.\n";

while ($flag) {

    $flag = 0;

    if ($count > 1) {
        print "On page $count.\n";
        $poet = $ARGV[0] . "\/$count";
    }

    unlink($name . "_poet.html");
    system("wget -q -O\'" . $name . "_poet.html\' \'http:\/\/plagiaris
+t.com\/poetry\/poets\/" . $poet . "\/\'");
    open(FILE, $name . "_poet.html");

    my $line;
    $flag = 0;
    foreach $line (<FILE>) {
        if ($line =~ m/<li><a href=\"http\:\/\/plagiarist\.com\/poetry
+\/(\d+)\/\">/) {
            $flag = 1;
            print "           Getting poem number $1.\n";
            unlink("temp.html");
            system("wget -q -O\'" . $name . "_temp.html\' \'http://pla
+giarist.com/poetry/" . $1 . "/\'");
            open(POEM, $name . "_temp.html");

            {
                local $/ = undef;
                my $poem = <POEM>;
                my ($match) = $poem =~ m/<div id=\"poem\">(.*?)<\/div>
+/ms;
                my ($title) = $match =~ m/<title>(.*?)<\/title>/ms;
                $match =~ s/(?:<.*?>|\<!--|-->|\n\n\n)//mg;
                $match =~ s/^(?:Submitted by.*|    poem|    )$//mg;

                $hash{$title} .= $match;

            }

            close(POEM);

        }
    }

    close(FILE);
    
    $count ++;
}

print "Writing it down...\n";

unlink($name . ".txt");
open(OUTPUT, ">$name" . ".txt");

my $title;

foreach $title (sort(keys(%hash))) {
    print "       $title\n";
    print OUTPUT "_PAGEBREAK_\n\n$hash{$title}\n";
}

close(OUTPUT);

unlink($name . "_temp.html");
unlink($name . "_poet.html");
Replies are listed 'Best First'.
Re: auden.pl: a plagiarist.com poetry scraper
by nmcfarl (Pilgrim) on Jul 16, 2004 at 17:01 UTC

    A resource I didn't know existed until I had a tool, your script, to use it effectively . Cool.

    Along the lines of constructive criticism, the excessive use of '\' make the script a little hard to read at points. Use of q{} , qq{} but mainly alternate delimiters on your m// (that is going with m{} or m|| ) could make things a good bit more readable. See Regexp Quote Like Operators

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://374801]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-03-29 13:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found