http://qs321.pair.com?node_id=485882

xdg has asked for the wisdom of the Perl Monks concerning the following question:

I tried Super Search to see if this had been discussed, but most of the deluge of RSS questions seem to consist of "I'm trying to scrape RSS and I'm clueless, please help" so I gave up in frustration. Apologies if I missed something obvious somewhere.

I'm not clueless and I've been working with RSS for a while now (c.f Code for Perlmonks XML to RSS), and I'm a little frustrated with various incompatibilties and breakage that I encounter dealing with people's feeds. I'm currently using combinations of XML::RSS and XML::RAI -- though largely because that's what I started with. So my questions are these:

  1. What modules for RSS parsing have people found to be the most robust and stable (given unreliable, non-standard input feeds)?

  2. What modules best parse all the various feed standards? (E.g. XML::RSS docs are inconsistent about RSS 2.0 support)

  3. What modules best produce all the various feed standards?

  4. What pre-processing have people found helpful in cleaning up non-standard feeds to keep XML::Parser and the like from giving up on errors?

On that last point, I'll share my own helpful snippet. I'm currently doing a rather hackish bit with a regex and HTML::Entities::Numbered to fix up some of the broken encodings that I'm commonly finding on various feeds that was breaking XML::Parser. YMMV.

$content =~ s/(&#\d+);?/$1;/g; $content = name2decimal_xml( $content );

Thanks,

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.