http://qs321.pair.com?node_id=11115421

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

I want to chip in where I can on documentation efforts, and there's a lot of opportunity now. I noticed a couple typos as I was going through WWW::Mechanize::Chrome and asked Corion if he wouldn't mind me getting involved with it. He says, sure, so I've been working through the examples and then the links. Almost by necessity, one has to take a step back and make comparisons to WWW::Mechanize. I've used this for years, but it's been 4 years since having done so, so anything I once knew about forms or the like is completely out the window. I selected my best candidate from the examples and couched it in the logging scheme that is at least verbose in other cases. Here's the bash invocation followed by the source:

$ ./2.quotes.pl Fargo >1.txt No matches for "Fargo" were found. $ ll 1.txt -rw-r--r-- 1 hogan hogan 128858 Apr 12 22:02 1.txt $ cat 2.quotes.pl #!/usr/bin/perl -w use strict; use 5.016; use WWW::Mechanize; use Getopt::Long; use Text::Wrap; use Log::Log4perl; use Data::Dump; my $log_conf = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; Log::Log4perl::init($log_conf); my $logger = Log::Log4perl->get_logger(); #$logger->level('DEBUG'); my $match = undef; my $random = undef; GetOptions( "match=s" => \$match, "random" => \$random, ) or exit 1; my $movie = shift @ARGV or die "Must specify a movie\n"; my $quotes_page = get_quotes_page($movie); my @quotes = extract_quotes($quotes_page); if ($match) { $match = quotemeta($match); @quotes = grep /$match/i, @quotes; } if ($random) { print $quotes[ rand @quotes ]; } else { print join( "\n", @quotes ); } sub get_quotes_page { my $movie = shift; my $mech = WWW::Mechanize->new; $mech->get("https://www.imdb.com/search/name-text/"); $mech->success or die "Can't get the search page"; open my $fh, '>', '/home/hogan/Documents/hogan/logs/1.form-log.txt' or die "Couldn't open logfile 'form-log.txt': $!"; $mech->dump_forms($fh); my $ret1 = $mech->submit_form( form_number => 2, fields => { title => $movie, restrict => "Movies only", }, ); $logger->info("return1 is $ret1"); # dd $ret1; # yikes if ( $ret1->is_success ) { $logger->info("Supposedly successful so far"); print $ret1->decoded_content; } else { print STDERR $ret1->status_line, "\n"; } my @links = $mech->find_all_links( url_regex => qr[^/Title] ) or die "No matches for \"$movie\" were found.\n"; # Use the first link my ( $url, $title ) = @{ $links[0] }; warn "Checking $title...\n"; $mech->get($url); my $link = $mech->find_link( text_regex => qr/Memorable Quotes/i ) or die qq{"$title" has no quotes in IMDB!\n}; warn "Fetching quotes...\n\n"; $mech->get( $link->[0] ); return $mech->content; } sub extract_quotes { my $page = shift; # Nibble away at the unwanted HTML at the beginnning... $page =~ s/.+Memorable Quotes//si; $page =~ s/.+?(<a name)/$1/si; # ... and the end of the page $page =~ s/Browse titles in the movie quotes.+$//si; $page =~ s/<p.+$//g; # Quotes separated by an <HR> tag my @quotes = split( /<hr.+?>/, $page ); for my $quote (@quotes) { my @lines = split( /<br>/, $quote ); for (@lines) { s/<[^>]+>//g; # Strip HTML tags s/\s+/ /g; # Squash whitespace s/^ //; # Strip leading space s/ $//; # Strip trailing space s/&#34;/"/g; # Replace HTML entity quotes # Word-wrap to fit in 72 columns $Text::Wrap::columns = 72; $_ = wrap( '', ' ', $_ ); } $quote = join( "\n", @lines ); } return @quotes; } __END__ $

When we have a look at what forms were available, we have:

$ cat /home/hogan/Documents/hogan/logs/1.form-log.txt GET https://www.imdb.com/find [nav-search-form] navbar-search-category-select=<UNDEF> (checkbox) [*<UNDEF>/off|on] q= (text) <NONAME>=<UNDEF> (submit) ref_=nv_sr_sm (hidden readonly) POST https://www.imdb.com/search/title-text/ type=plot (option) [*plot/Plot|quotes/Quotes| +trivia/Trivia|goofs/Goofs|crazy_credits/Crazy Credits|location/Filmin +g Locations|soundtracks/Soundtracks|versions/Versions] query= (search) <NONAME>=<UNDEF> (submit) POST https://www.imdb.com/search/name-text/ type=bio (option) [*bio/Biographies|quotes/Q +uotes|trivia/Trivia] query= (search) <NONAME>=<UNDEF> (submit) $

The way I count it, we want the 1st one instead of the second, with zero-indexing. Either way, this is about where I lose the handle on it. I have trouble time and again with the output overwhelming the terminal. With the log from Log4perl, I have:

2020/04/12 22:20:23 INFO return1 is HTTP::Response=HASH(0x5653c7bfc2e8 +) 2020/04/12 22:20:23 INFO Supposedly successful so far

As far as I can tell, what one gets when one decodes this return value, it looks like the whole page in html form splashed out onto STDOUT, and this leaves me confused and sifting through stuff meant for machines. I've tried ARGV with other movies from imdb top 100 quote movies.

$ ./2.quotes.pl Fargo >1.txt No matches for "Fargo" were found. $ ./2.quotes.pl Jaws >1.txt No matches for "Jaws" were found.

How gratifying it would be to see:

We're gonna need a bigger boat.

Q1) What do I need to do to get this script working?

Q2) What is the relationship between WWW::Mechanize and modules like WWW::Mechanize::Gzip and WWW::Mechanize::Chrome? The former uses this line:

use base qw(WWW::Mechanize);

, while the latter seems to reference its "base" module in the raw source. Does either "inherit" anything from its base?

Update: I botched the second half of this question as I made the comparison. What I meant to ask was:

Q2) ..., while the latter seems to lack reference to its "base" module in the raw source.

Q3) If I went to the link and typed in Jaws, would I get 128k worth of html?

$ ll 1.txt -rw-r--r-- 1 hogan hogan 128858 Apr 12 22:02 1.txt

Q4) Doesn't "scraping" connote going after an entire class of files like images, or have I done it here without even trying?

Thanks for your comment,