Web Scraping with Find / Replace

sjfranzen has asked for the wisdom of the Perl Monks concerning the following question:

I am sure this is easy for you monks, but this new guy would love your help. I need to read a set of web pages, extract content between a

and then find all referential links and replace them with fully qualified links(i.e. add domain to the link) then finally save it to a file. Thanks in advance


use feature 'say'; # a better "print"
use Mojo;

######################################################
my $insert_str = "https://www.somesite.com";

#get the pages to fetch from the links.txt file
open (LINK, "links.txt") || die "couldn't open the file!";

my $ua = Mojo::UserAgent->new;

#loop through all of the urls
while ($record = <LINK>) {
  say ("Getting web site info for: $record\n");

  #determine the new file name by the subdirectory / path  since all f
+etched pages will be index.html
  $newFileName = (substr $record, (rindex($record, "/", (rindex($recor
+d, "/") -1)) + 1), (rindex($record, "/") - rindex($record, "/", (rind
+ex($record, "/") -1)) -1)) . '.html';

  print("Should save the information to a new file as $newFileName\n")
+;

  #get the page contents

  my $response = $ua->get($record)->res->dom;

  if ($response->is_success) {
    #Find the  <div class="main-content">
    my $content = $response->at('.main-content');

    #TODO Replace all of the links with fully qualified url's

    #TODO Save the master_content to a file with the same file name
    
  }
  # else {
#      die $response->status_line;
#      #TODO Send an email to admin letting them know of the issue
#  }
  #end of while loop
}
close(LINK);
[download]

Comment on Web Scraping with Find / Replace Download Code

Replies are listed 'Best First'.
Re: Web Scraping with Find / Replace (Mojo::DOM) by beech (Parson) on Dec 01, 2016 at 22:19 UTC
Hi, #TODO Replace all of the links with fully qualified url's #TODO Save the master_content to a file with the same file name Here you go `use Path::Tiny qw/ path /; path( $newFileName )->spew_utf8( qq{<base href="$insert_str">}, $conte +nt );` [download] You might need to html-escape $insert_str ... could use Mojo for that part `$ perl -Mojo -e " $dom = x(q{<base>}); $dom->at(q{base})->attr(qw{href + http://example.com/?&}); print $dom " <base href="http://example.com/?&">` [download] See Path::Tiny, https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base, https://metacpan.org/pod/ojo#x	[reply] [d/l] [select]
Re^2: Web Scraping with Find / Replace (Mojo::DOM) by sjfranzen (Initiate) on Dec 02, 2016 at 16:25 UTC
Thank you for your response. Unfortunately I do not understand your approach or how to include in my script.	[reply]
Re^3: Web Scraping with Find / Replace (Mojo::DOM) by beech (Parson) on Dec 04, 2016 at 20:55 UTC
Well, If you add a base tag to the html content, then there is no need to rewrite relative links into absolute links, its a shortcut provided by html The spew part of the code does that with a helper module for creating a file Second part shows creating/modifying a base tag with Mojo which will htmlescape the url	[reply]

Back to Seekers of Perl Wisdom


We don't bite newbies here... much
	PerlMonks