http://qs321.pair.com?node_id=11115421

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

I want to chip in where I can on documentation efforts, and there's a lot of opportunity now. I noticed a couple typos as I was going through WWW::Mechanize::Chrome and asked Corion if he wouldn't mind me getting involved with it. He says, sure, so I've been working through the examples and then the links. Almost by necessity, one has to take a step back and make comparisons to WWW::Mechanize. I've used this for years, but it's been 4 years since having done so, so anything I once knew about forms or the like is completely out the window. I selected my best candidate from the examples and couched it in the logging scheme that is at least verbose in other cases. Here's the bash invocation followed by the source:

$ ./2.quotes.pl Fargo >1.txt No matches for "Fargo" were found. $ ll 1.txt -rw-r--r-- 1 hogan hogan 128858 Apr 12 22:02 1.txt $ cat 2.quotes.pl #!/usr/bin/perl -w use strict; use 5.016; use WWW::Mechanize; use Getopt::Long; use Text::Wrap; use Log::Log4perl; use Data::Dump; my $log_conf = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; Log::Log4perl::init($log_conf); my $logger = Log::Log4perl->get_logger(); #$logger->level('DEBUG'); my $match = undef; my $random = undef; GetOptions( "match=s" => \$match, "random" => \$random, ) or exit 1; my $movie = shift @ARGV or die "Must specify a movie\n"; my $quotes_page = get_quotes_page($movie); my @quotes = extract_quotes($quotes_page); if ($match) { $match = quotemeta($match); @quotes = grep /$match/i, @quotes; } if ($random) { print $quotes[ rand @quotes ]; } else { print join( "\n", @quotes ); } sub get_quotes_page { my $movie = shift; my $mech = WWW::Mechanize->new; $mech->get("https://www.imdb.com/search/name-text/"); $mech->success or die "Can't get the search page"; open my $fh, '>', '/home/hogan/Documents/hogan/logs/1.form-log.txt' or die "Couldn't open logfile 'form-log.txt': $!"; $mech->dump_forms($fh); my $ret1 = $mech->submit_form( form_number => 2, fields => { title => $movie, restrict => "Movies only", }, ); $logger->info("return1 is $ret1"); # dd $ret1; # yikes if ( $ret1->is_success ) { $logger->info("Supposedly successful so far"); print $ret1->decoded_content; } else { print STDERR $ret1->status_line, "\n"; } my @links = $mech->find_all_links( url_regex => qr[^/Title] ) or die "No matches for \"$movie\" were found.\n"; # Use the first link my ( $url, $title ) = @{ $links[0] }; warn "Checking $title...\n"; $mech->get($url); my $link = $mech->find_link( text_regex => qr/Memorable Quotes/i ) or die qq{"$title" has no quotes in IMDB!\n}; warn "Fetching quotes...\n\n"; $mech->get( $link->[0] ); return $mech->content; } sub extract_quotes { my $page = shift; # Nibble away at the unwanted HTML at the beginnning... $page =~ s/.+Memorable Quotes//si; $page =~ s/.+?(<a name)/$1/si; # ... and the end of the page $page =~ s/Browse titles in the movie quotes.+$//si; $page =~ s/<p.+$//g; # Quotes separated by an <HR> tag my @quotes = split( /<hr.+?>/, $page ); for my $quote (@quotes) { my @lines = split( /<br>/, $quote ); for (@lines) { s/<[^>]+>//g; # Strip HTML tags s/\s+/ /g; # Squash whitespace s/^ //; # Strip leading space s/ $//; # Strip trailing space s/&#34;/"/g; # Replace HTML entity quotes # Word-wrap to fit in 72 columns $Text::Wrap::columns = 72; $_ = wrap( '', ' ', $_ ); } $quote = join( "\n", @lines ); } return @quotes; } __END__ $

When we have a look at what forms were available, we have:

$ cat /home/hogan/Documents/hogan/logs/1.form-log.txt GET https://www.imdb.com/find [nav-search-form] navbar-search-category-select=<UNDEF> (checkbox) [*<UNDEF>/off|on] q= (text) <NONAME>=<UNDEF> (submit) ref_=nv_sr_sm (hidden readonly) POST https://www.imdb.com/search/title-text/ type=plot (option) [*plot/Plot|quotes/Quotes| +trivia/Trivia|goofs/Goofs|crazy_credits/Crazy Credits|location/Filmin +g Locations|soundtracks/Soundtracks|versions/Versions] query= (search) <NONAME>=<UNDEF> (submit) POST https://www.imdb.com/search/name-text/ type=bio (option) [*bio/Biographies|quotes/Q +uotes|trivia/Trivia] query= (search) <NONAME>=<UNDEF> (submit) $

The way I count it, we want the 1st one instead of the second, with zero-indexing. Either way, this is about where I lose the handle on it. I have trouble time and again with the output overwhelming the terminal. With the log from Log4perl, I have:

2020/04/12 22:20:23 INFO return1 is HTTP::Response=HASH(0x5653c7bfc2e8 +) 2020/04/12 22:20:23 INFO Supposedly successful so far

As far as I can tell, what one gets when one decodes this return value, it looks like the whole page in html form splashed out onto STDOUT, and this leaves me confused and sifting through stuff meant for machines. I've tried ARGV with other movies from imdb top 100 quote movies.

$ ./2.quotes.pl Fargo >1.txt No matches for "Fargo" were found. $ ./2.quotes.pl Jaws >1.txt No matches for "Jaws" were found.

How gratifying it would be to see:

We're gonna need a bigger boat.

Q1) What do I need to do to get this script working?

Q2) What is the relationship between WWW::Mechanize and modules like WWW::Mechanize::Gzip and WWW::Mechanize::Chrome? The former uses this line:

use base qw(WWW::Mechanize);

, while the latter seems to reference its "base" module in the raw source. Does either "inherit" anything from its base?

Update: I botched the second half of this question as I made the comparison. What I meant to ask was:

Q2) ..., while the latter seems to lack reference to its "base" module in the raw source.

Q3) If I went to the link and typed in Jaws, would I get 128k worth of html?

$ ll 1.txt -rw-r--r-- 1 hogan hogan 128858 Apr 12 22:02 1.txt

Q4) Doesn't "scraping" connote going after an entire class of files like images, or have I done it here without even trying?

Thanks for your comment,

Replies are listed 'Best First'.
Re: running an example script with WWW::Mechanize* module
by Corion (Patriarch) on Apr 13, 2020 at 06:52 UTC

    WWW::Mechanize::Chrome doesn't inherit from WWW::Mechanize, but it strives to provide the same API as WWW::Mechanize where possible/applicable.

    In some situations, I've deviated from the WWW::Mechanize API unfortunately, instead of either using a different method name or expanding the API in a compatible way...

      WWW::Mechanize::Chrome doesn't inherit from WWW::Mechanize, but it strives to provide the same API as WWW::Mechanize where possible/applicable.

      Okay, and this because you don't have this in WMC:

      use base qw(WWW::Mechanize);

      right? I have to wonder if you considered other namespaces to put this, in particular when I found out that WWW::Mechanize inherits from LWP::UserAgent:

      $ pwd /usr/local/share/perl/5.26.1/WWW/Mechanize $ .. $ ls Mechanize Mechanize.pm $ cat Mechanize.pm | more package WWW::Mechanize; #ABSTRACT: Handy web browsing in a Perl object use strict; use warnings; our $VERSION = '1.96'; use Tie::RefHash; use HTTP::Request 1.30; use LWP::UserAgent 5.827; use HTML::Form 1.00; use HTML::TokeParser; use Scalar::Util qw(tainted); use base 'LWP::UserAgent';
      In some situations, I've deviated from the WWW::Mechanize API unfortunately, instead of either using a different method name or expanding the API in a compatible way...

      From all of this background reading of what happened over the last 20 years of the internet and perl, the idea that one would want to faithfully represent in every detail what worked in 2003 with what works in 2015 seems like folly. I'll try out the new stuff and see how I do with it.

      I hauled out one of my favorite WMG scripts only to find that it doesn't populate values correctly anymore, so I'm ready to start using newer tools. I have achieved such a minor amount of success. Between readmore tags I'll post the older script that I'm trying to modernize:

      I'm fairly confident that it behaved and produced accurate results. (There is a chance that it was a script that I was trying to extend and lost my way. I can't always tell them apart.) Now let's look at how far I've gotten with WMC:

      #! /usr/bin/perl use warnings; use strict; use WWW::Mechanize::Chrome; use HTML::TableExtract qw(tree); use open ':std', OUT => ':utf8'; use Prompt::Timeout; use constant TIMEOUT => 3; use constant MAXTRIES => 30; ## redesign for solar eclipse of aug 21, 2017 ### begin 2020 rewrite ### with WWW::Mechanize::Chrome ### and Log::Log4perl use Log::Log4perl; use Data::Dump; use 5.016; my $log_conf3 = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; my $log_conf4 = "/home/hogan/Documents/hogan/logs/conf_files/4.conf"; #Log::Log4perl::init($log_conf3); #debug Log::Log4perl::init($log_conf4); #info my $logger = Log::Log4perl->get_logger(); my $current_level = $logger->level(); $logger->info("script begins with $current_level"); my $a = 'b'; for my $i ( 1 .. 2 ) { say "i is $i"; $logger->info("i is $i"); my $site = 'http://www.fourmilab.ch/yoursky/cities.html'; my $mech = WWW::Mechanize::Chrome->new( headless => 1, ); $mech->get($site); $mech->follow_link( text_regex => qr/Portland OR/i ); say "We are at " . $mech->uri; if ( $mech->success() ) { open my $gh, '>', "$a.form-log.txt" or warn "Couldn't open logfile $a.form-log.txt $!"; $mech->dump_forms($gh); say $gh "========="; } my $guess = 2458960; #Earth day 2020 in julian days $mech->form_number($i); say "$i works" if $mech->success(); say $mech->current_form->{name}; # ?? say "current form has a name" if $mech->success(); ## syntax that used to work with WWW::Mechanize # $mech->set_fields(qw'date 2'); #$mech->set_fields(); $mech->field( date => '2' ); ## analogs to set set_fields in WM say "first field set succeeded" if $mech->success(); $mech->field( jd => $guess ); say "second field set succeeded" if $mech->success(); $mech->click_button( value => "Update" ); # this seems similar to + WM say "clickbutton succeeded" if $mech->success(); my $string = $mech->uri; $logger->info("We are at $string") if $mech->success(); ## get a screenshot of how far we made it my $page_png = $mech->content_as_png(); my $base = '/home/hogan/5.scripts/1.corion./template_stuff/aimag +es'; my $fn = $base . "/$a.png"; open my $fh, '>', $fn or die "Couldn't create '$fn': $!"; binmode $fh, ':raw'; print $fh $page_png; close $fh; print "exiting show_screen with letter $a\n"; my $n = 2; $logger->info("sleeping for $n seconds ===================="); $mech->sleep($n); $a++; }

      Terminal output:

      $ ./5.pluto.pl script begins with 20000 i is 1 i is 1 Connected to ws://127.0.0.1:43601/devtools/browser/0c9af664-f568-4f8a- +bd7f-496dd9593030 We are at http://www.fourmilab.ch/cgi-bin/Yoursky?z=1&lat=45.5183&ns=N +orth&lon=122.676&ew=West 1 works Use of uninitialized value in say at ./5.pluto.pl line 55. current form has a name <no text> at /usr/local/share/perl/5.26.1/WWW/Mechanize/Chrome.pm line + 3779. <no text> at /usr/local/share/perl/5.26.1/WWW/Mechanize/Chrome.pm line + 3779. <no text> at /usr/local/share/perl/5.26.1/WWW/Mechanize/Chrome.pm line + 3779. 3 elements found for input with name 'date' at ./5.pluto.pl line 63. $

      The log output shows that I've been jimmying with the loop values that I'm using to figure out which form I need. It turns out, it is not zero-based in this context. Zero bombs out.

      2020/04/18 17:18:49 INFO script begins with 20000 2020/04/18 17:18:49 INFO Connected to ws://127.0.0.1:35721/devtools/br +owser/37b006e1-5a92-4191-b50f-7475ff4d12d9 2020/04/18 17:24:03 INFO script begins with 20000 2020/04/18 17:24:03 INFO i is 0 2020/04/18 17:24:03 INFO Connected to ws://127.0.0.1:38435/devtools/br +owser/79feea55-b183-4da4-9111-b43ddb825bdd 2020/04/18 17:26:33 INFO script begins with 20000 2020/04/18 17:26:33 INFO i is 1 2020/04/18 17:26:33 INFO Connected to ws://127.0.0.1:43601/devtools/br +owser/0c9af664-f568-4f8a-bd7f-496dd9593030

      Whilst far short of a grand opus or masterpiece, this script does something that I couldn't manage with WM, namely, effective logging. Log::Log4Perl is required on WMC, which has the advantage of that functionality. The disadvantage is that you've gotta get it installed, which has been a (fixable) problem for some. It didn't want to install with my strawberry perl on windows 10, which I house on another partition. Two ways to solve the problem are given in getting Log::Log4perl to install on windows strawberry perl.

      So, where am I stuck? Well, this is hot off the press and represents several similar attempts. It's nice to be using WMC and log4perl to figure this out. You can't be reading the same things the machines do as it overwhelms STDOUT. My partial results are encouraging, and this seems very much like a problem of getting forms and fields set and selected with new methods calls. Here is the uri we're looking at. It's a fun site, and you can readily enter your own information.

      I do have data from the formdump:

      [FORM] request /cgi-bin/Yoursky [INPUT (submit)] <no name> [INPUT (radio)] date [INPUT (radio)] date [INPUT (text)] utc [INPUT (radio)] date [INPUT (text)] jd [INPUT (text)] lat [INPUT (radio)] ns [INPUT (radio)] ns [INPUT (text)] lon [INPUT (radio)] ew [INPUT (radio)] ew [INPUT (checkbox)] coords [INPUT (checkbox)] moonp [INPUT (checkbox)] deep [INPUT (text)] deepm [INPUT (checkbox)] consto [INPUT (checkbox)] constn [INPUT (checkbox)] consta [INPUT (checkbox)] consts [INPUT (checkbox)] constb [INPUT (text)] limag [INPUT (checkbox)] starn [INPUT (text)] starnm [INPUT (checkbox)] starb [INPUT (text)] starbm [INPUT (checkbox)] flip [INPUT (text)] imgsize [INPUT (text)] fontscale [SELECT (select-one)] scheme [INPUT (checkbox)] edump [TEXTAREA (textarea)] elements

      Fishing for tips. Thanks, Corion for your response and this considerable achievement:

      $ wc -l $(locate Chrome.pm) 5761 /home/hogan/Documents/repos/wmc/WWW-Mechanize-Chrome/lib/WWW/Me +chanize/Chrome.pm 5708 /usr/local/share/perl/5.26.1/WWW/Mechanize/Chrome.pm 11469 total $

        There is a misunderstanding of $mech->success - this method only reflects whether the last HTTP response from the server is considered an error or not. It does not reflect whether the last operation on $mech was successful or not. Error checking is usually done by die by WWW::Mechanize::Chrome.

        I haven't run your code, but the log output suggests that the form you're looking at has no name:

        say $mech->current_form->{name}; # ?? # Use of uninitialized value in say at ./5.pluto.pl line 55.

        The form is not great, because it really contains three fields with the same name date, so you will have to fetch the individual fields and explicitly set them:

        # largely untested my @date_fields = $mech->selector('.//*[@name="date"]', node => $self- +>current_form ); $mech->set_field( $date_fields[1] => $guess );

        In the next version, I'll actually implement the arrayref form of ->set_fields() for values of index larger than one :) But that means breaking my (incompatible) API to restore the WWW::Mechanize API so I'll have to look carefully there.

        $mech->set_fields( $name => [ 'foo', 2 ] );
Re: running an example script with WWW::Mechanize* module
by Aldebaran (Curate) on Apr 19, 2020 at 01:18 UTC
    Q1) What do I need to do to get this script working?

    I've realized looking at this WWW::Mechanize how old in internet years this is now (now is 18-04-2020). I couldn't figure out just when it hit the scene, but I did find a treatment of it at this link from 2002. They have the similar treatment of imdb, but the site has changed completely since then. I was furthered impressed by its burgeoning age by this:

    my %known_agents = ( 'Windows IE 6' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows + NT 5.1)',

    Over the last several days, I've read up about the Browser_wars, and I guess I was there for it, but just not really aware that chrome had displaced IE as the most popular browser. Now I see that chrome and firefox are leading the field, so lucky for us perlers that we have some tools so as to not have to toss our hands up and say that we can't deal with javascript-enabled sites, which seems to be a lot of them as I was poking around again. The imdb site was javascript-enabled, so the script in the original post is poorly suited to that module, at least now.

    Q4) Doesn't "scraping" connote going after an entire class of files like images, or have I done it here without even trying?

    I think one obviates this by logging in:

    # get login data for using Config::Tiny use Config::Tiny; my $ini_path = qw( /home/hogan/Documents/html_template_data/3.values +.ini ); say "ini path is $ini_path"; my $sub_hash = "perlmonks"; my $Config = Config::Tiny->new; $Config = Config::Tiny->read( $ini_path, 'utf8' ); my $username = $Config->{$sub_hash}{'username'}; my $password = $Config->{$sub_hash}{'password'}; say "values are $username $password ";

    If the site doesn't like what your script is doing when you're signed in, they can let you know. I really would like to work up this example, but with WMC instead. My partner and I will watch Netflix, HBO, Amazon, and we're always trying to match the actors up to where we've seen them last, so I will get on my android and make the actual keystrokes on other occasions.

    So I would like to have something I could work up to launch on android, but with WMC, which should have the advantages of being on a native platform.

      "If the site doesn't like what your script is doing when you're signed in, they can let you know. I really would like to work up this example, but with WMC instead. My partner and I will watch Netflix, HBO, Amazon, and we're always trying to match the actors up to where we've seen them last, so I will get on my android and make the actual keystrokes on other occasions."

      Have you looked at the many utilities on cpan for scraping data from IMDB? While packages exist here is a short proof of concept for accessing data, just a short (sub optimal) example to get you started:

      #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::URL; use Mojo::Util qw(trim); use Mojo::UserAgent; my $imdburl = 'http://www.imdb.com/search/title?title=Caddyshack'; # pretend to be a browser my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 ( +KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name( $uaname ); # find search results my $dom = $ua->get( $imdburl )->res->dom; # assume first match my $filmurl = $dom->find('a[href^=/title]')->first->attr('href'); # extract film id my $filmid = Mojo::URL->new( $filmurl )->path->parts->[-1]; # get details of film $dom = $ua->get( "https://www.imdb.com/title/$filmid/" )->res->dom; # print film details say trim( $dom->at('div.title_wrapper > h1')->text ) . ' (' . trim( $d +om->at('#titleYear > a')->text ) .')'; # print actor/character names foreach my $cast ( $dom->find('table.cast_list > tr:not(:first-child)' +)->each ){ say trim ($cast->at('td:nth-of-type(2) > a')->text ) . ' as ' . trim + ( $cast->at('td.character')->all_text ); }

      Output:

      Caddyshack (1980) Chevy Chase as Ty Webb Rodney Dangerfield as Al Czervik Ted Knight as Judge Elihu Smails Michael O'Keefe as Danny Noonan Bill Murray as Carl Spackler Sarah Holcomb as Maggie O'Hooligan Scott Colomby as Tony D'Annunzio Cindy Morgan as Lacey Underall Dan Resin as Dr. Beeper Henry Wilcoxon as The Bishop Elaine Aiken as Mrs. Noonan Albert Salmi as Mr. Noonan Ann Ryerson as Grace Brian Doyle-Murray as Lou Loomis Hamilton Mitchell as Motormouth

      See also Mojo::UserAgent, Mojo::DOM, Mojo::URL, ojo (should you want one liners). You could adapt the above to print all matches and prompt for which one you want, rather than assume the first one (since remakes, sequels/prequels etc..), allow you to select the actor and return the details of all the other films/shows they have been in.

      Update: If you would prefer some sort of web interface to the results wrap the above around Mojolicious::Lite

        marto, thx for your reply...your script works right out of the gate. I believe that this is the second time I've followed Mojo:: based scripts you've posted, finding them useful in both circumstances. Also thanks for the tip to have a gander on cpan. It was interesting for me to look at the source, and clearly, if I wanted to pursue IMDB further, that would be the route.

        Let me tell you why I would like to shift away from IMDB. It's that when I save the dom file to disc, it's 2.2 megs, which is nowhere near what I can lay eyes on and understand. Machines get it with the help of javascript, but I am only intermediate at best in my understanding of any of the matters I am writing about now.

        I spent some time looking at Mojo:: beginning with:

        #!/usr/bin/env perl use Mojolicious::Lite; get '/' => sub { my $c = shift; $c->render(text => 'Hello World!'); }; app->start; __END__

        I'm not sure what it all means, but it seems to work and indicate that I have the capabilities I might expect.

        $ ./1.mojo.pl daemon [2020-04-25 19:20:12.17317] [2970] [info] Listening at "http://*:3000" Server available at http://127.0.0.1:3000 ^C$

        I had a bit of aha moment when comparing the logs for WMC:

        2020/04/29 14:29:32 DEBUG Connecting to ws://127.0.0.1:37749/devtools/browser/58619c4b-5292-4d3a-a0f7-6b69c01c73dc

        I don't know how to understand code short of working it and seeing the outcome. So I fiddle around with the examples I can work and then try to re-train the script on a different target, a smaller one that I might be able to understand and that furthers my goals with what I want to do for web automation.

        If you would prefer some sort of web interface to the results wrap the above around Mojolicious::Lite

        I'm hoping that I can get an event clicked on this page maybe with the Mojo:: family of tools. It's the site I've always gone to for ephemeral data and is further set for Portland, OR. I'm looking to get the radio button for julian day pressed and the value for jd populated by:

        my $julian_day = 2458960;

        , updated, and then I want to extract all the values from the table, but with a particular emphasis on getting whether the Sun is up or not at that precise time.

        Corion and I have been trying to crack this with WMC, and we're not quite there. Here's how this button looks when I ask google chrome's inspector about it:

        <c>full XPath: /html/body/form/center/table/tbody/tr[1]/td/table/tbody +/tr[3]/td[1]/input selector: body > form > center > table > tbody > tr:nth-child(1) > td +> table > tbody > tr:nth-child(3) > td:nth-child(1) > a XPath: /html/body/form/center/table/tbody/tr[1]/td/table/tbody/tr +[3]/td[1]/a JSPath: document.querySelector("body > form > center > table > tbody > + tr:nth-child(1) > td > table > tbody > tr:nth-child(3) > td:nth-chil +d(1) > a")

        Another representation. I think this is what the DOM looks like after data dumped:

        [ "tag", "input", { name => "date", onclick => 0, ty +pe => "radio", value => 2 }, 'fix', ], ["text", " ", 'fix'], [ "tag", "a", { href => "/yoursky/help/controls. +html#Julian" }, 'fix', ["text", "Julian day:", 'fix'], ], ["text", "\n", 'fix'], ], ["text", "\n", 'fix'], [ "tag", "td", {}, 'fix', ["text", "\n", 'fix'], [ "tag", "input", { name => "jd", onchange => "document.request.da +te[2].checked=true;", size => 20, type => "text", value => 2458963.36684, },

        Put simply, can Mojolicious do this?