Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Scraping a website - Unterminated quoted string

by Staralfur (Novice)
on May 04, 2017 at 18:41 UTC ( [id://1189518]=perlquestion: print w/replies, xml ) Need Help??

Staralfur has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl -w use strict; use warnings; my $url = "https://deutsch.rt.com/inland/"; my $path = "/home/me/RT/rt_scrape.html"; open PATH, "> $path" or die $!; my $counter; while (defined $url) { my %hash; my $html = qx(curl "$url"); undef($url); if ($html =~ m/href="(\/listing\/category.inland\/prepare\/last-news\/ +\d+">)Weiter/) { $url = "www.deutsch.rt.com" . $1; } my @row = split(/\n/,$html); foreach (@row){ if ($html =~ m/href="(\/inland\/\d+\-.*?\/" class="cover__link + link ">)/g) { my $articles_url = "https://deutsch.rt.com/" . $1; $hash{$articles_url} = 1; } } foreach (keys %hash){ $counter++; my $article = qx(curl $_); open PATH, "> /home/me/RT/$counter.txt" or die $!; print "\n\tFetching\n$_\n"; print PATH "$article"; close PATH; } }

I am a Perl newbie and I am trying to scrape the archive with all of the inland articles. Some articles are downloaded, but the files are empty.

The error message in the terminal is: "Fetching https://deutsch.rt.com//inland/50091-bundesregierung-giftgasvorwurfe-assad-saudi-arabien/" class="cover__link link "> sh: 1: Syntax error: Unterminated quoted string"

So my script starts downloading the articles, but then something happens.. Can you please help me to write a functional script?

Replies are listed 'Best First'.
Re: Scraping a website - Unterminated quoted string
by huck (Prior) on May 04, 2017 at 18:55 UTC

    In this line

    if ($html =~ m/href="(\/listing\/category.inland\/prepare\/last-news\/ +\d+">)Weiter/) {
    your capture includes stuff into $1 not considered part of the url, including double quotes. Try this
    if ($html =~ m/href="(\/listing\/category.inland\/prepare\/last-news\/ +\d+)">Weiter/) {
    same at
    if ($html =~ m/href="(\/inland\/\d+\-.*?\/" class="cover__link link "> +)/g)
    try
    if ($html =~ m/href="(\/inland\/\d+\-.*?\/)" class="cover__link link " +>/g)

Re: Scraping a website - Unterminated quoted string
by kennethk (Abbot) on May 04, 2017 at 21:59 UTC
    In general, parsing HTML from the wild using regular expressions is an exercise in frustration. I'd highly recommend pulling down HTML::Tree.

    Also, there's no real reason to shell out to curl. I use LWP::UserAgent, though for low barrier to entry you may prefer LWP::Simple.

    With regard to your output, lexical file handles and two argument open would be better practice (there is nnothing wrong with what you are doing per se). So you might replace

    open PATH, "> /home/me/RT/$counter.txt" or die $!; print "\n\tFetching\n$_\n"; print PATH "$article"; close PATH;
    with
    open my $path, '>', "/home/me/RT/$counter.txt" or die $!; print "\n\tFetching\n$_\n"; print $path $article;

    The file will automatically close when you go out of scope, it will handle some potential escaping issues that two-argument cannot, and there is no need to quote the article content before printing it.

    Lastly, if you are scraping, you may be violating terms of service, so please check on that for the site you are accessing. At the least, you should put a sleep in there to be polite.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Scraping a website - Unterminated quoted string
by Discipulus (Canon) on May 05, 2017 at 07:50 UTC
    hello Staralfur and welcome to the monastery and to the wonderful world of Perl

    since you are a newbie, I permit to suggest something. First as already said, take the habit to use 3 args form for open using lexical filehandles: open my $fh, '<', $file_path or die "Unable to open [$file_path] $!" infact if you use $file_path as variable you can print it also in the die message, using square brackets to be sure you have no typos in it. In addition to $! you might want to print also $^E or last OS error. See them in perlvar

    Now about your script: this is not scraping is.. curling ;=)

    Scraping the web is a black art, and i'm still a newbie in that but besides basic tasks accomplished via LWP::UserAgent you can use App::scrape (fixed link thanks to kennethk) by our dear brother Corion or Web::Scraper by the genial author of Plack / PSGI Miyagawa.

    You can read aboout perl web scraping at my homenode in the scraping link section

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1189518]
Approved by davies
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-20 02:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found