comment on

I'm banging my head against the wall on this one, and I don't understand why I'm getting these results.

I have a script I wrote that grabs an XML feed from a news site, extracts <link>, <pubDate> and <title> from the feed (via XML::Simple) follows the link referenced in the news feed to the original article, and then pulls the content out of the body of the article.

As part of the "final article" body extraction, I'm also trying to pull the author's name out of the HTML content itself, using a fairly simple regex.

While testing this, my regex stopped working, and I tried to debug it by writing the contents of $html to a local file, and examining that file.

What I have looks like this, for the relevant section:

  my $req   = HTTP::Request->new(GET => $link) or die $!;
  my $res   = $ua->request($req);
  my $html  = $res->content;

  # write_file() comes from File::Slurp
  # $item_id is the article ID extracted from <link>
  write_file($item_id,  {binmode => ':raw' }, $html);

  # Original source string looks like: 
  # <a href="http://news.example.com/?author=John_Smith">John Smith</a
+>
  my ($other, $author) = $html =~ /\?author=(.*?)">(.*)<\/a>/;

  # $author is blank, empty here, why?
  print "AUTHOR: $author\n";

  my $new_html = read_file($item_id);

  my ($n_other, $n_author) = $new_html =~ /\?author=(.*?)">(.*)<\/a>/;

  # Now $author contains the right name, "Mike Smith" for
  # example.
  print "AUTHOR: $n_author\n";
[download]

The problem I'm having, is that when I read the remote content into $html, via res->content, and try to extract $author from it, it fails.

When I write $html to disk, then IMMEDIATELY read that same physical file back from disk into a new scalar ($new_html above), and then run the same exact regex across it, it works fine.

WHY?!

In reply to A regex on the same content fails and works, with conditions by hacker

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks