Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

need help determining which web browsing module to use

by Special_K (Monk)
on Nov 12, 2020 at 21:06 UTC ( [id://11123625]=perlquestion: print w/replies, xml ) Need Help??

Special_K has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a script to scrape a webpage but am not a web developer and am having trouble figuring out which module I need to install. Using Firefox's Web Developer Inspector tool, I see that an object I have highlighted has a hierarchical position in the webpage's html and sequence of

tags, with the object's identifying tag being:

<div class="grouped-item product-purchase-wrapper-7117">

There are multiple such objects on the webpage, each with their own unique number in place of "7117", but with all other text in the div identical to the example above (and at the same level of hierarchy in the webpage). What I would like, if it exists, is a module that will read the webpage's hierarchy into a data structure, allow me to specify a base path to a specific point in the webpage's hierarchy, and then allow me to iterate over all objects/div sections at that level of hierarchy and below. Which module should I use?

Replies are listed 'Best First'.
Re: need help determining which web browsing module to use
by marto (Cardinal) on Nov 13, 2020 at 03:06 UTC

    If you don't need JavaScript support, I'd use Mojo. Firstly an experiment to find the stuff:

    #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use feature 'say'; my $html = '<div class="grouped-item product-purchase-wrapper-1">One< +/div><div class="grouped-item product-purchase-wrapper-7117">7117</d +iv>'; my $dom = Mojo::DOM->new( $html ); # find each div with a class beginning grouped-item product-purchase-w +rapper foreach my $div ( $dom->find('div[.class^=i"grouped-item product-purch +ase-wrapper"]')->each ){ say $div->text; }

    Prints:

    One 7117

    Getting it from some live site:

    #!/usr/bin/perl use strict; use warnings; use Mojo::UserAgent; use feature 'say'; my $ua = Mojo::UserAgent->new; my $url = 'https://urlgoeshere'; my $dom = $ua->get( $url )->res->dom; foreach my $div ( $dom->find('div[.class^=i"grouped-item product-purch +ase-wrapper"]')->each ){ say $div->text; }

    See Mojo::DOM, Mojo::UserAgent, Super search for more mojo goodness.

    Update: if you do need JavaScript support I'd suggest automating Chrome using WWW::Mechanize::Chrome, and using the xpath method.

Re: need help determining which web browsing module to use
by GrandFather (Saint) on Nov 13, 2020 at 00:00 UTC

    Probably there are two components you need to think about: fetching the page and processing it, and a multitude of ways to go about the task. The old school way to go about this might be to use LWP::Simple to fetch the page and HTML::Tree to parse and manipulate the contents. A more modern approach might be to use one of the WWW::Mechanize::xxx suite of modules such as WWW::Mechanize::Firefox.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      "...such as WWW::Mechanize::Firefox."

      Just an FYI, this has been difficult for some time due to the dependency on an old, unsupported Firefox. See Important Notice.

        Yes, changes to Firefox caused problems! I use some Firefox plug-ins that aren't compatible with current releases (like SQLite Mgr). So I installed Waterfox and use it to run these older Firefox plug-ins. Whether or not Mechanize::Firefox would work with Waterfox, I don't know. It might. A more promising avenue is probably Mechanize::Chrome. I am having trouble getting that critter built so I can't say how well it works or doesn't work.

        Update: Default cpan install of version 0.60 hangs on one test. However, Corion helped me and..
        cpan CORION/WWW-Mechanize-Chrome-0.61.tar.gz appears to work to force a 0.61 installation. Some test cases don't pass but I don't think that matters. At least installation of the 0.61 version does not "hang" and it gets back to a command prompt.

        As another addendum, I don't recommend Waterfox for browsing. I current just use it to run an SQLite tool that I like and know how to use. I've never bothered to fiddle with the command line tool for SQLite. The GUI tool gives the info that I need and I can test out SQL commands. Once I have a working SQLite command, implementing that in Perl code is fairly straightforward - the magic of the Perl DBI!

Re: need help determining which web browsing module to use
by Marshall (Canon) on Nov 14, 2020 at 19:16 UTC
    I am certainly no expert on HTML. I am not sure exactly what you have. One line doesn't tell me much. Over the years I've written a few web scrapers with LWP and a couple with WWW::Mechanize. As long as the webpage is serving up just HTML code instead of javascript, you can use the base WWW::Mechanize module. That's been the case so far in my current applications. If the webpage requires executing javascript code, then Perl cannot do that alone. In that case, you will need WWW:Mechanize::Chrome or similar. In that case, Perl controls the browser and has the browser execute the Javascript code. Mechanize sees the result of what the browser's javascript code did.

    I would start by reading Cpan Mech Docs and then take a look at some Mech examples. Then I would start "hacking" and experimenting and see how far you can get with the base Mech module. If you are using a public, heavily trafficked web site, then show us the URL.

    Also be aware of the potential impact that your code could have on the target web site. I have one application that "beats up" one web site pretty good. But I have agreement with the site owner about what hours and what days my application can run. This is an important consideration if you are going to retrieve a lot of data.

    Update: s/Java/Javascript/; #Completely different things!

      'Perl controls the browser and has the browser execute the Java code'

      JavaScript, not Java.

        Thanks for the correction!
Re: need help determining which web browsing module to use
by perlfan (Vicar) on Nov 19, 2020 at 06:05 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11123625]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 23:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found