Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Question regarding web scraping

by Lisa1993 (Acolyte)
on Oct 22, 2016 at 13:52 UTC ( [id://1174510]=perlquestion: print w/replies, xml ) Need Help??

Lisa1993 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody,

I am very very new to using Perl (as in have only been using it intermittently for less than a couple of weeks), and I'm having a problem with a very basic web scraping programme.

To cut a long story short, I am taking a university module where the instructor has told us to scrape comments from the website Reddit, so that we can analyse the comments using computational linguistics methods.

He has given us the code below to use. The goal is to get the programme to search for every comment on the page and then to save the comments only in an html file.
use LWP::Simple; $URL = 'https://www.reddit.com/r/unitedkingdom/comments/58m2hs/i_danie +l_blake_is_released_today/'; $CONTENT = get($URL); while ($CONTENT =~ <div class=\"usertext-body may-blank-within md-cont +ainer \"><div class=\"md\">(.+?)<\/div> <\/div><\/form><ul class=\"flat-list buttons\"> //gs ) { $x = "$1"; $y = "$y $x"; } open(OUT, "">C:/Users/user/perl_tests/reddittest.txt"); print OUT "$y"; close(OUT);
However, when I try to run the code (from my windows command centre) I get the following error message:
Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\user>perl perl_tests/red1.pl Scalar found where operator expected at perl_tests/red1.pl line 12, ne +ar "$x = " $1" (Might be a runaway multi-line "" string starting on line 7) (Missing operator before $1?) String found where operator expected at perl_tests/red1.pl line 12, ne +ar "$y = " " (Missing semicolon on previous line?) Scalar found where operator expected at perl_tests/red1.pl line 12, ne +ar "$y $x" (Missing operator before $x?) String found where operator expected at perl_tests/red1.pl line 12, ne +ar "open(O UT, "" (Missing semicolon on previous line?) String found where operator expected at perl_tests/red1.pl line 15, ne +ar "open(O UT, "">C:/Users/user/perl_tests/reddittest.txt"" Can't modify numeric lt (<) in scalar assignment at perl_tests/red1.pl + line 12, near "$x = "$1" syntax error at perl_tests/red1.pl line 12, near "$x = "$1" Execution of perl_tests/red1.pl aborted due to compilation errors.
I know that there is probably a very "basic" error in the code, but I'm too inexperienced as of yet to know how to fix it. Any help would be greatly appreciated. Thank you!

Replies are listed 'Best First'.
Re: Question regarding web scraping
by hippo (Bishop) on Oct 22, 2016 at 14:37 UTC
    while ($CONTENT =~ <div class=\"usertext-body may-blank-within md-cont +ainer \"><div class=\"md\">(.+?)<\/div> <\/div><\/form><ul class=\"flat-list buttons\"> //gs )

    This line gives an error. The thing to the right side of the =~ should be a match (a regular expression) but it is not. Perhaps move one of the terminating slashes to the beginning of the expression instead of having 2 at the end.

    Addendum: This line:

    open(OUT, "">C:/Users/user/perl_tests/reddittest.txt");

    is also badly formed. See open for examples of how to call this function correctly.

      Thank you very much!
Re: Question regarding web scraping
by Corion (Patriarch) on Oct 22, 2016 at 14:39 UTC

    The following is a malformed regular expression:

    while ($CONTENT =~ <div class=\"usertext-body may-blank-within md-cont +ainer \"><div class=\"md\">(.+?)<\/div><\/div><\/form><ul class=\"fla +t-list buttons\"> //gs )

    It is at least missing the s/ start.

    Personally, I suggest that you do the content extraction by using HTML::TreeBuilder and XPath or CSS selectors (via HTML::TreeBuilder::XPath and HTML::Selector::CSS).

    Also note that Reddit has an API available, so you maybe don't need to scrape at all but can get the comments in a machine readable format directly.

    Also note that on CPAN, there are many Reddit modules available, and it seems that Reddit::Client is using the Reddit API.

      Thank you very much! I will look into these alternatives. Thanks again for your suggestions.
Re: Question regarding web scraping
by Athanasius (Archbishop) on Oct 22, 2016 at 14:58 UTC

    Hello Lisa1993, and welcome to the Monastery!

    As Corion says, you’d be better off using a dedicated module to extract the HTML you want. But in the meantime...

    hippo has identified the syntax errors in your code. But, even when these are fixed, the regular expression won’t match any of the content on the web page in question. To get it to match, I had to tweak it in two places:

    use strict; use warnings; use LWP::Simple; my $URL = 'https://www.reddit.com/r/unitedkingdom/comments/58m2hs/ +' . 'i_daniel_blake_is_released_today/'; my $CONTENT = get($URL); my $regex = '<div class="usertext-body may-blank-within md-container + ">' . '<div class="md">(.+?)</div>\s*</div>' . '</form><ul class="flat-list buttons">'; my $x = ''; my $count = 0; while ($CONTENT =~ m{$regex}gs) { $x .= $1; ++$count; } print $x; print "Count: $count\n";

    First, you need to allow for (optional) whitespace between the two closing </div> tags. Second, you need to remove the space at the end of the regex. Also note that it isn’t necessary to escape the quotation character, and you can avoid escaping forward slashes by changing the regex delimiter (as shown above).

    With these changes, I get 80 matches.

    Also note the inclusion of use strict and use warnings, and the declaration of variables using my. This is basic good practice in Perl.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      That's brilliant! You've made my day! Thank you very much.

      Can I just ask too, is there any way to run the script for multiple URL's at once? Or would I need a more complicated programme for that?

      Thanks again!

        It's trivial by wrapping part of your code in a for() loop, and turning the single scalar $URL link into an array @URLS, that contains a list of urls instead. The for() loop iterates over this list. Note that this assumes the regex is the same for all urls. Untested:

        use strict; use warnings; use LWP::Simple; my @URLS = qw( http://one.example.com http://two.example.com http://three.example.com ); my $regex = '<div class="usertext-body may-blank-within md-container + ">' . '<div class="md">(.+?)</div>\s*</div>' . '</form><ul class="flat-list buttons">'; for my $URL (@URLS){ my $CONTENT = get($URL); my $x = ''; my $count = 0; while ($CONTENT =~ m{$regex}gs){ $x .= $1; ++$count; } print "---$URL---\n"; print $x; print "Count: $count\n"; }
Re: Question regarding web scraping
by marto (Cardinal) on Oct 23, 2016 at 08:29 UTC

    An alternative would be to request a JSON object rather than the rendered web page, do this by appending .json to the end of your URL like so:

    https://www.reddit.com/r/unitedkingdom/comments/58m2hs/i_danie+l_blake +_is_released_today/.json

    I've no idea what tool you're using further down the line for analysis, but HTML seems like a odd format to store such data. Here is a short example, simply printing the name of the poster and the comment:

    #!/usr/bin/perl use strict; use warnings; use Mojo::UserAgent; my $url ='https://www.reddit.com/r/unitedkingdom/comments/58m2hs/i_dan +ie+l_blake_is_released_today/.json'; my $ua = Mojo::UserAgent->new; my $data = $ua->get( $url )->res->json; foreach my $comment ( @{$data} ) { foreach my $child ( @{ $comment->{'data'}->{'children'} } ) { print $child->{'data'}->{'author'} . " posted:" .$/; print $child->{'data'}->{'body'} . "\n" if( $child->{'data'}-> +{'body'} ); } }

    You'll need the Mojo::UserAgent module:

    #install via cpan cpan Mojo::UserAgent #or cpanm cpanm Mojo::UserAgent

    From the brief example above you can see how to get just what you want, or add some other bells and whistles. The example isn't particulary pretty in it's output, I'll leave that an an exercise for you. You can examine the JSON in browser (some plugins exist to prettify the content) or you can use something like json_pp to print it from the command line.

    Update: So I read some other comments you made, if you're trying to do this for various sub-reddits you can easily adapt the above example to:

    • For each sub reddit url (append .json)
    • Get each thread
    • Follow the existing code to print comments (or save to a file)
    • sleep for a few seconds...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1174510]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-03-29 07:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found