Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Web Scraping with Find / Replace

by sjfranzen (Initiate)
on Dec 01, 2016 at 22:00 UTC ( [id://1177088]=perlquestion: print w/replies, xml ) Need Help??

sjfranzen has asked for the wisdom of the Perl Monks concerning the following question:

I am sure this is easy for you monks, but this new guy would love your help. I need to read a set of web pages, extract content between a
and then find all referential links and replace them with fully qualified links(i.e. add domain to the link) then finally save it to a file. Thanks in advance
use feature 'say'; # a better "print" use Mojo; ###################################################### my $insert_str = "https://www.somesite.com"; #get the pages to fetch from the links.txt file open (LINK, "links.txt") || die "couldn't open the file!"; my $ua = Mojo::UserAgent->new; #loop through all of the urls while ($record = <LINK>) { say ("Getting web site info for: $record\n"); #determine the new file name by the subdirectory / path since all f +etched pages will be index.html $newFileName = (substr $record, (rindex($record, "/", (rindex($recor +d, "/") -1)) + 1), (rindex($record, "/") - rindex($record, "/", (rind +ex($record, "/") -1)) -1)) . '.html'; print("Should save the information to a new file as $newFileName\n") +; #get the page contents my $response = $ua->get($record)->res->dom; if ($response->is_success) { #Find the <div class="main-content"> my $content = $response->at('.main-content'); #TODO Replace all of the links with fully qualified url's #TODO Save the master_content to a file with the same file name } # else { # die $response->status_line; # #TODO Send an email to admin letting them know of the issue # } #end of while loop } close(LINK);

Replies are listed 'Best First'.
Re: Web Scraping with Find / Replace (Mojo::DOM)
by beech (Parson) on Dec 01, 2016 at 22:19 UTC
      Thank you for your response. Unfortunately I do not understand your approach or how to include in my script.

        Well,

        If you add a base tag to the html content, then there is no need to rewrite relative links into absolute links, its a shortcut provided by html

        The spew part of the code does that with a helper module for creating a file

        Second part shows creating/modifying a base tag with Mojo which will htmlescape the url

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1177088]
Approved by philipbailey
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-25 12:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found