Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: running an example script with WWW::Mechanize* module

by Aldebaran (Deacon)
on Apr 19, 2020 at 01:18 UTC ( #11115761=note: print w/replies, xml ) Need Help??


in reply to running an example script with WWW::Mechanize* module

Q1) What do I need to do to get this script working?

I've realized looking at this WWW::Mechanize how old in internet years this is now (now is 18-04-2020). I couldn't figure out just when it hit the scene, but I did find a treatment of it at this link from 2002. They have the similar treatment of imdb, but the site has changed completely since then. I was furthered impressed by its burgeoning age by this:

my %known_agents = ( 'Windows IE 6' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows + NT 5.1)',

Over the last several days, I've read up about the Browser_wars, and I guess I was there for it, but just not really aware that chrome had displaced IE as the most popular browser. Now I see that chrome and firefox are leading the field, so lucky for us perlers that we have some tools so as to not have to toss our hands up and say that we can't deal with javascript-enabled sites, which seems to be a lot of them as I was poking around again. The imdb site was javascript-enabled, so the script in the original post is poorly suited to that module, at least now.

Q4) Doesn't "scraping" connote going after an entire class of files like images, or have I done it here without even trying?

I think one obviates this by logging in:

# get login data for using Config::Tiny use Config::Tiny; my $ini_path = qw( /home/hogan/Documents/html_template_data/3.values +.ini ); say "ini path is $ini_path"; my $sub_hash = "perlmonks"; my $Config = Config::Tiny->new; $Config = Config::Tiny->read( $ini_path, 'utf8' ); my $username = $Config->{$sub_hash}{'username'}; my $password = $Config->{$sub_hash}{'password'}; say "values are $username $password ";

If the site doesn't like what your script is doing when you're signed in, they can let you know. I really would like to work up this example, but with WMC instead. My partner and I will watch Netflix, HBO, Amazon, and we're always trying to match the actors up to where we've seen them last, so I will get on my android and make the actual keystrokes on other occasions.

So I would like to have something I could work up to launch on android, but with WMC, which should have the advantages of being on a native platform.

Replies are listed 'Best First'.
Re^2: running an example script with WWW::Mechanize* module
by marto (Cardinal) on Apr 19, 2020 at 09:02 UTC

    "If the site doesn't like what your script is doing when you're signed in, they can let you know. I really would like to work up this example, but with WMC instead. My partner and I will watch Netflix, HBO, Amazon, and we're always trying to match the actors up to where we've seen them last, so I will get on my android and make the actual keystrokes on other occasions."

    Have you looked at the many utilities on cpan for scraping data from IMDB? While packages exist here is a short proof of concept for accessing data, just a short (sub optimal) example to get you started:

    #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::URL; use Mojo::Util qw(trim); use Mojo::UserAgent; my $imdburl = 'http://www.imdb.com/search/title?title=Caddyshack'; # pretend to be a browser my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 ( +KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = Mojo::UserAgent->new; $ua->max_redirects(5)->connect_timeout(20)->request_timeout(20); $ua->transactor->name( $uaname ); # find search results my $dom = $ua->get( $imdburl )->res->dom; # assume first match my $filmurl = $dom->find('a[href^=/title]')->first->attr('href'); # extract film id my $filmid = Mojo::URL->new( $filmurl )->path->parts->[-1]; # get details of film $dom = $ua->get( "https://www.imdb.com/title/$filmid/" )->res->dom; # print film details say trim( $dom->at('div.title_wrapper > h1')->text ) . ' (' . trim( $d +om->at('#titleYear > a')->text ) .')'; # print actor/character names foreach my $cast ( $dom->find('table.cast_list > tr:not(:first-child)' +)->each ){ say trim ($cast->at('td:nth-of-type(2) > a')->text ) . ' as ' . trim + ( $cast->at('td.character')->all_text ); }

    Output:

    Caddyshack (1980) Chevy Chase as Ty Webb Rodney Dangerfield as Al Czervik Ted Knight as Judge Elihu Smails Michael O'Keefe as Danny Noonan Bill Murray as Carl Spackler Sarah Holcomb as Maggie O'Hooligan Scott Colomby as Tony D'Annunzio Cindy Morgan as Lacey Underall Dan Resin as Dr. Beeper Henry Wilcoxon as The Bishop Elaine Aiken as Mrs. Noonan Albert Salmi as Mr. Noonan Ann Ryerson as Grace Brian Doyle-Murray as Lou Loomis Hamilton Mitchell as Motormouth

    See also Mojo::UserAgent, Mojo::DOM, Mojo::URL, ojo (should you want one liners). You could adapt the above to print all matches and prompt for which one you want, rather than assume the first one (since remakes, sequels/prequels etc..), allow you to select the actor and return the details of all the other films/shows they have been in.

    Update: If you would prefer some sort of web interface to the results wrap the above around Mojolicious::Lite

      marto, thx for your reply...your script works right out of the gate. I believe that this is the second time I've followed Mojo:: based scripts you've posted, finding them useful in both circumstances. Also thanks for the tip to have a gander on cpan. It was interesting for me to look at the source, and clearly, if I wanted to pursue IMDB further, that would be the route.

      Let me tell you why I would like to shift away from IMDB. It's that when I save the dom file to disc, it's 2.2 megs, which is nowhere near what I can lay eyes on and understand. Machines get it with the help of javascript, but I am only intermediate at best in my understanding of any of the matters I am writing about now.

      I spent some time looking at Mojo:: beginning with:

      #!/usr/bin/env perl use Mojolicious::Lite; get '/' => sub { my $c = shift; $c->render(text => 'Hello World!'); }; app->start; __END__

      I'm not sure what it all means, but it seems to work and indicate that I have the capabilities I might expect.

      $ ./1.mojo.pl daemon [2020-04-25 19:20:12.17317] [2970] [info] Listening at "http://*:3000" Server available at http://127.0.0.1:3000 ^C$

      I had a bit of aha moment when comparing the logs for WMC:

      2020/04/29 14:29:32 DEBUG Connecting to ws://127.0.0.1:37749/devtools/browser/58619c4b-5292-4d3a-a0f7-6b69c01c73dc

      I don't know how to understand code short of working it and seeing the outcome. So I fiddle around with the examples I can work and then try to re-train the script on a different target, a smaller one that I might be able to understand and that furthers my goals with what I want to do for web automation.

      If you would prefer some sort of web interface to the results wrap the above around Mojolicious::Lite

      I'm hoping that I can get an event clicked on this page maybe with the Mojo:: family of tools. It's the site I've always gone to for ephemeral data and is further set for Portland, OR. I'm looking to get the radio button for julian day pressed and the value for jd populated by:

      my $julian_day = 2458960;

      , updated, and then I want to extract all the values from the table, but with a particular emphasis on getting whether the Sun is up or not at that precise time.

      Corion and I have been trying to crack this with WMC, and we're not quite there. Here's how this button looks when I ask google chrome's inspector about it:

      <c>full XPath: /html/body/form/center/table/tbody/tr[1]/td/table/tbody +/tr[3]/td[1]/input selector: body > form > center > table > tbody > tr:nth-child(1) > td +> table > tbody > tr:nth-child(3) > td:nth-child(1) > a XPath: /html/body/form/center/table/tbody/tr[1]/td/table/tbody/tr +[3]/td[1]/a JSPath: document.querySelector("body > form > center > table > tbody > + tr:nth-child(1) > td > table > tbody > tr:nth-child(3) > td:nth-chil +d(1) > a")

      Another representation. I think this is what the DOM looks like after data dumped:

      [ "tag", "input", { name => "date", onclick => 0, ty +pe => "radio", value => 2 }, 'fix', ], ["text", " ", 'fix'], [ "tag", "a", { href => "/yoursky/help/controls. +html#Julian" }, 'fix', ["text", "Julian day:", 'fix'], ], ["text", "\n", 'fix'], ], ["text", "\n", 'fix'], [ "tag", "td", {}, 'fix', ["text", "\n", 'fix'], [ "tag", "input", { name => "jd", onchange => "document.request.da +te[2].checked=true;", size => 20, type => "text", value => 2458963.36684, },

      Put simply, can Mojolicious do this?

        "Let me tell you why I would like to shift away from IMDB. It's that when I save the dom file to disc, it's 2.2 megs, which is nowhere near what I can lay eyes on and understand. Machines get it with the help of javascript, but I am only intermediate at best in my understanding of any of the matters I am writing about now."

        One of the nice things about Mojo::DOM is the support for CSS Selectors (see the Mojo docs section Learning Web Technologies). You don;t have to figure these out for yourself, you can use your browsers 'developer tools' GUI to click on things and copy their CSS selector/path. Searching for a tutorial for whatever browser you use should produce many videos/tutorials demoing this sort of thing. The selectors aren't always optimal, just looking at the HTML source can often point to much shorter selectors in many cases. Mojo::UserAgent makes it fairly simple to send data to web interfaces, and the return object contains the resulting DOM (->res->dom above) which you can then use to display/capture whatever data you like. Give it a shot and let me know if you have any problems.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11115761]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2020-11-30 12:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?