I am sure this is easy for you monks, but this new guy would love your help. I need to read a set of web pages, extract content between a
and then find all referential links and replace them with fully qualified links(i.e. add domain to the link) then finally save it to a file. Thanks in advance
use feature 'say'; # a better "print"
use Mojo;
######################################################
my $insert_str = "https://www.somesite.com";
#get the pages to fetch from the links.txt file
open (LINK, "links.txt") || die "couldn't open the file!";
my $ua = Mojo::UserAgent->new;
#loop through all of the urls
while ($record = <LINK>) {
say ("Getting web site info for: $record\n");
#determine the new file name by the subdirectory / path since all f
+etched pages will be index.html
$newFileName = (substr $record, (rindex($record, "/", (rindex($recor
+d, "/") -1)) + 1), (rindex($record, "/") - rindex($record, "/", (rind
+ex($record, "/") -1)) -1)) . '.html';
print("Should save the information to a new file as $newFileName\n")
+;
#get the page contents
my $response = $ua->get($record)->res->dom;
if ($response->is_success) {
#Find the <div class="main-content">
my $content = $response->at('.main-content');
#TODO Replace all of the links with fully qualified url's
#TODO Save the master_content to a file with the same file name
}
# else {
# die $response->status_line;
# #TODO Send an email to admin letting them know of the issue
# }
#end of while loop
}
close(LINK);
Re: Web Scraping with Find / Replace (Mojo::DOM)
by beech (Parson) on Dec 01, 2016 at 22:19 UTC
|
Hi, #TODO Replace all of the links with fully qualified url's
#TODO Save the master_content to a file with the same file name
Here you go
use Path::Tiny qw/ path /;
path( $newFileName )->spew_utf8( qq{<base href="$insert_str">}, $conte
+nt );
You might need to html-escape $insert_str ... could use Mojo for that part
$ perl -Mojo -e " $dom = x(q{<base>}); $dom->at(q{base})->attr(qw{href
+ http://example.com/?&}); print $dom "
<base href="http://example.com/?&">
See Path::Tiny, https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base, https://metacpan.org/pod/ojo#x | [reply] [d/l] [select] |
|
Thank you for your response. Unfortunately I do not understand your approach or how to include in my script.
| [reply] |
|
Well,
If you add a base tag to the html content, then there is no need to rewrite relative links into absolute links, its a shortcut provided by html
The spew part of the code does that with a helper module for creating a file
Second part shows creating/modifying a base tag with Mojo which will htmlescape the url
| [reply] |