Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

A regex on the same content fails and works, with conditions

by hacker (Priest)
on Oct 22, 2007 at 00:56 UTC ( [id://646338]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'm banging my head against the wall on this one, and I don't understand why I'm getting these results.

I have a script I wrote that grabs an XML feed from a news site, extracts <link>, <pubDate> and <title> from the feed (via XML::Simple) follows the link referenced in the news feed to the original article, and then pulls the content out of the body of the article.

As part of the "final article" body extraction, I'm also trying to pull the author's name out of the HTML content itself, using a fairly simple regex.

While testing this, my regex stopped working, and I tried to debug it by writing the contents of $html to a local file, and examining that file.

What I have looks like this, for the relevant section:

my $req = HTTP::Request->new(GET => $link) or die $!; my $res = $ua->request($req); my $html = $res->content; # write_file() comes from File::Slurp # $item_id is the article ID extracted from <link> write_file($item_id, {binmode => ':raw' }, $html); # Original source string looks like: # <a href="http://news.example.com/?author=John_Smith">John Smith</a +> my ($other, $author) = $html =~ /\?author=(.*?)">(.*)<\/a>/; # $author is blank, empty here, why? print "AUTHOR: $author\n"; my $new_html = read_file($item_id); my ($n_other, $n_author) = $new_html =~ /\?author=(.*?)">(.*)<\/a>/; # Now $author contains the right name, "Mike Smith" for # example. print "AUTHOR: $n_author\n";

The problem I'm having, is that when I read the remote content into $html, via res->content, and try to extract $author from it, it fails.

When I write $html to disk, then IMMEDIATELY read that same physical file back from disk into a new scalar ($new_html above), and then run the same exact regex across it, it works fine.

WHY?!

Replies are listed 'Best First'.
Re: A regex on the same content fails and works, with conditions
by GrandFather (Saint) on Oct 22, 2007 at 01:20 UTC

    I suspect a line end or character encoding issue of some sort. Especially as you:

    write_file($item_id, {binmode => ':raw' }, $html);

    but then:

    my $new_html = read_file($item_id);

    Can you copy and paste the chunk that the regex is not matching and that you expect ought to, or at least the version that is matching? If it's publicly accessable, can you post an URL too?


    Perl is environmentally friendly - it saves trees

      I added binmode, because I wanted to be sure the data coming off the socket was EXACTLY the same as what was expected.

      The regex fails in the first case, whether I use binmode or not, in write_file().

Re: A regex on the same content fails and works, with conditions
by ikegami (Patriarch) on Oct 22, 2007 at 05:45 UTC

    You called the snippet "the relevant section". Why don't you give us a snippet that actually gives the error instead of what you *think* is the relevant section. The one you posted doesn't run as is.

Re: A regex on the same content fails and works, with conditions
by Gangabass (Vicar) on Oct 22, 2007 at 01:54 UTC

    Hm... Why you use binmode? I think you recieve text data so no need in binmode.

    If you remove binmode option how the source string will looks in your file?

Re: A regex on the same content fails and works, with conditions
by hacker (Priest) on Oct 22, 2007 at 16:29 UTC

    I managed to solve this using some HTML::Element fu:

    (my $author) = map $_->as_text, $t->look_down(_tag => 'a', href => +qr{^http://news\.example\.com/\?author=});

    But while testing this, it appears the upstream site has some blocking/throttling mechanisms, so now I can't test it because they're throwing back pages indicating I'm "reading articles faster than a human can read" (my code had a 10-second delay in it).

    Now I'm adding randomization across an array of anonymous proxies to try to alleviate that blocking, but the list of proxies is not reliable.

    Too many yaks to shave in one day.

      When a site tells you that you are hitting it too hard, it is pretty darn rude to try to thwart them by going through anonymous proxies. Instead of wasting your time trying to violate their attempts to control access to their site, why don't you just reduce the frequency of your hammering them while you compose a polite letter asking for permission (if the second step is even required).

      - tye        

Re: A regex on the same content fails and works, with conditions
by snopal (Pilgrim) on Oct 22, 2007 at 13:07 UTC

    I get the impression that the data in $html and the data in $item_id is not exactly the same. One thing that might not be the same is newlines. Newlines don't play well with regular expressions unless you tell your expression that you want it to ignore them in the buffer variable.

    my ($other, $author) = $html =~ \?author=(.*?)">(.*)<\/a>/s; #---------------------------------------------------------^

    Just a stab in the dark here, assuming all other things are equal.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://646338]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-03-29 08:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found