Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
In general, parsing HTML from the wild using regular expressions is an exercise in frustration. I'd highly recommend pulling down HTML::Tree.

Also, there's no real reason to shell out to curl. I use LWP::UserAgent, though for low barrier to entry you may prefer LWP::Simple.

With regard to your output, lexical file handles and two argument open would be better practice (there is nnothing wrong with what you are doing per se). So you might replace

open PATH, "> /home/me/RT/$counter.txt" or die $!; print "\n\tFetching\n$_\n"; print PATH "$article"; close PATH;
with
open my $path, '>', "/home/me/RT/$counter.txt" or die $!; print "\n\tFetching\n$_\n"; print $path $article;

The file will automatically close when you go out of scope, it will handle some potential escaping issues that two-argument cannot, and there is no need to quote the article content before printing it.

Lastly, if you are scraping, you may be violating terms of service, so please check on that for the site you are accessing. At the least, you should put a sleep in there to be polite.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.


In reply to Re: Scraping a website - Unterminated quoted string by kennethk
in thread Scraping a website - Unterminated quoted string by Staralfur

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2021-12-07 06:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (33 votes). Check out past polls.

    Notices?