Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: running an example script with WWW::Mechanize* module

by marto (Cardinal)
on Apr 19, 2020 at 09:02 UTC ( #11115788=note: print w/replies, xml ) Need Help??


in reply to Re: running an example script with WWW::Mechanize* module
in thread running an example script with WWW::Mechanize* module

"If the site doesn't like what your script is doing when you're signed in, they can let you know. I really would like to work up this example, but with WMC instead. My partner and I will watch Netflix, HBO, Amazon, and we're always trying to match the actors up to where we've seen them last, so I will get on my android and make the actual keystrokes on other occasions."

Have you looked at the many utilities on cpan for scraping data from IMDB? While packages exist here is a short proof of concept for accessing data, just a short (sub optimal) example to get you started:

#!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::URL; use Mojo::Util qw(trim); use Mojo::UserAgent; my $imdburl = 'http://www.imdb.com/search/title?title=Caddyshack'; # pretend to be a browser my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 ( +KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name( $uaname ); # find search results my $dom = $ua->get( $imdburl )->res->dom; # assume first match my $filmurl = $dom->find('a[href^=/title]')->first->attr('href'); # extract film id my $filmid = Mojo::URL->new( $filmurl )->path->parts->[-1]; # get details of film $dom = $ua->get( "https://www.imdb.com/title/$filmid/" )->res->dom; # print film details say trim( $dom->at('div.title_wrapper > h1')->text ) . ' (' . trim( $d +om->at('#titleYear > a')->text ) .')'; # print actor/character names foreach my $cast ( $dom->find('table.cast_list > tr:not(:first-child)' +)->each ){ say trim ($cast->at('td:nth-of-type(2) > a')->text ) . ' as ' . trim + ( $cast->at('td.character')->all_text ); }

Output:

Caddyshack (1980) Chevy Chase as Ty Webb Rodney Dangerfield as Al Czervik Ted Knight as Judge Elihu Smails Michael O'Keefe as Danny Noonan Bill Murray as Carl Spackler Sarah Holcomb as Maggie O'Hooligan Scott Colomby as Tony D'Annunzio Cindy Morgan as Lacey Underall Dan Resin as Dr. Beeper Henry Wilcoxon as The Bishop Elaine Aiken as Mrs. Noonan Albert Salmi as Mr. Noonan Ann Ryerson as Grace Brian Doyle-Murray as Lou Loomis Hamilton Mitchell as Motormouth

See also Mojo::UserAgent, Mojo::DOM, Mojo::URL, ojo (should you want one liners). You could adapt the above to print all matches and prompt for which one you want, rather than assume the first one (since remakes, sequels/prequels etc..), allow you to select the actor and return the details of all the other films/shows they have been in.

Update: If you would prefer some sort of web interface to the results wrap the above around Mojolicious::Lite

Replies are listed 'Best First'.
Re^3: running an example script with WWW::Mechanize* module
by Aldebaran (Deacon) on Apr 30, 2020 at 04:32 UTC

    marto, thx for your reply...your script works right out of the gate. I believe that this is the second time I've followed Mojo:: based scripts you've posted, finding them useful in both circumstances. Also thanks for the tip to have a gander on cpan. It was interesting for me to look at the source, and clearly, if I wanted to pursue IMDB further, that would be the route.

    Let me tell you why I would like to shift away from IMDB. It's that when I save the dom file to disc, it's 2.2 megs, which is nowhere near what I can lay eyes on and understand. Machines get it with the help of javascript, but I am only intermediate at best in my understanding of any of the matters I am writing about now.

    I spent some time looking at Mojo:: beginning with:

    #!/usr/bin/env perl use Mojolicious::Lite; get '/' => sub { my $c = shift; $c->render(text => 'Hello World!'); }; app->start; __END__

    I'm not sure what it all means, but it seems to work and indicate that I have the capabilities I might expect.

    $ ./1.mojo.pl daemon [2020-04-25 19:20:12.17317] [2970] [info] Listening at "http://*:3000" Server available at http://127.0.0.1:3000 ^C$

    I had a bit of aha moment when comparing the logs for WMC:

    2020/04/29 14:29:32 DEBUG Connecting to ws://127.0.0.1:37749/devtools/browser/58619c4b-5292-4d3a-a0f7-6b69c01c73dc

    I don't know how to understand code short of working it and seeing the outcome. So I fiddle around with the examples I can work and then try to re-train the script on a different target, a smaller one that I might be able to understand and that furthers my goals with what I want to do for web automation.

    If you would prefer some sort of web interface to the results wrap the above around Mojolicious::Lite

    I'm hoping that I can get an event clicked on this page maybe with the Mojo:: family of tools. It's the site I've always gone to for ephemeral data and is further set for Portland, OR. I'm looking to get the radio button for julian day pressed and the value for jd populated by:

    my $julian_day = 2458960;

    , updated, and then I want to extract all the values from the table, but with a particular emphasis on getting whether the Sun is up or not at that precise time.

    Corion and I have been trying to crack this with WMC, and we're not quite there. Here's how this button looks when I ask google chrome's inspector about it:

    <c>full XPath: /html/body/form/center/table/tbody/tr[1]/td/table/tbody +/tr[3]/td[1]/input selector: body > form > center > table > tbody > tr:nth-child(1) > td +> table > tbody > tr:nth-child(3) > td:nth-child(1) > a XPath: /html/body/form/center/table/tbody/tr[1]/td/table/tbody/tr +[3]/td[1]/a JSPath: document.querySelector("body > form > center > table > tbody > + tr:nth-child(1) > td > table > tbody > tr:nth-child(3) > td:nth-chil +d(1) > a")

    Another representation. I think this is what the DOM looks like after data dumped:

    [ "tag", "input", { name => "date", onclick => 0, ty +pe => "radio", value => 2 }, 'fix', ], ["text", " ", 'fix'], [ "tag", "a", { href => "/yoursky/help/controls. +html#Julian" }, 'fix', ["text", "Julian day:", 'fix'], ], ["text", "\n", 'fix'], ], ["text", "\n", 'fix'], [ "tag", "td", {}, 'fix', ["text", "\n", 'fix'], [ "tag", "input", { name => "jd", onchange => "document.request.da +te[2].checked=true;", size => 20, type => "text", value => 2458963.36684, },

    Put simply, can Mojolicious do this?

      "Let me tell you why I would like to shift away from IMDB. It's that when I save the dom file to disc, it's 2.2 megs, which is nowhere near what I can lay eyes on and understand. Machines get it with the help of javascript, but I am only intermediate at best in my understanding of any of the matters I am writing about now."

      One of the nice things about Mojo::DOM is the support for CSS Selectors (see the Mojo docs section Learning Web Technologies). You don;t have to figure these out for yourself, you can use your browsers 'developer tools' GUI to click on things and copy their CSS selector/path. Searching for a tutorial for whatever browser you use should produce many videos/tutorials demoing this sort of thing. The selectors aren't always optimal, just looking at the HTML source can often point to much shorter selectors in many cases. Mojo::UserAgent makes it fairly simple to send data to web interfaces, and the return object contains the resulting DOM (->res->dom above) which you can then use to display/capture whatever data you like. Give it a shot and let me know if you have any problems.

        One of the nice things about Mojo::DOM...

        I hadn't been looking there but found at the bottom a simple way to get the DOM into lexical perl that guys like me can understand. I don't get any buttons pushed here, but I'm so pleased with this script that I'm gonna post it. It represents my best achievement yet in getting the DOM information in a format I can read and not blowing me out on STDOUT using Data::Dump.

        $ ./3.mojo_fermi.pl >3.txt Wide character in print at /usr/local/share/perl/5.26.1/Log/Log4perl/A +ppender/File.pm line 313. Wide character in print at /usr/local/share/perl/5.26.1/Log/Log4perl/A +ppender/Screen.pm line 41. $ cat 3.mojo_fermi.pl #!/usr/bin/perl use strict; use warnings; use Mojo::URL; use Mojo::Util qw(dumper); use Mojo::UserAgent; use Data::Dump; use Log::Log4perl; use 5.016; use Mojo::DOM; my $log_conf3 = "/home/hogan/Documents/hogan/logs/conf_files/3.conf"; my $log_conf4 = "/home/hogan/Documents/hogan/logs/conf_files/4.conf"; #Log::Log4perl::init($log_conf3); #debug Log::Log4perl::init($log_conf4); #info my $logger = Log::Log4perl->get_logger(); $logger->info("$0"); my $site = 'https://www.fourmilab.ch/cgi-bin/Yoursky?z=1&lat=45.5183&ns=North&lon +=122.676&ew=West'; # pretend to be a browser my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like G +ecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name($uaname); # find search results my $dom = $ua->get($site)->res->dom; # dd $dom; #overwhelms STDOUT say "==========="; my @nodes = @$dom; # c-style for is good for array output with index for ( my $i = 0 ; $i < @nodes ; $i++ ) { $logger->info("i is $i =============="); $logger->info("$nodes[$i]"); } sleep 2; #good hygiene __END__ $

        I would excerpt my beautiful, straight, demarcated logs, but they're covered in symbols that won't render well here.

        Give it a shot and let me know if you have any problems.

        Thx, marto, I'll keep after it....

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11115788]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2020-11-30 12:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?