Extract Portion of HTML

pacohope has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I have 300 HTML pages in various states of HTML compliance. I'm basically trying to strip out all the header and footer junk and get all the middle of the document, even with any crappy HTML it might have.

Documents look something like this:

<html>
--stuff--
<head>
--more stuff--
</head>
<body>
--still more stuff--
<div class="myBody">
--all the stuff I want, which might include div tags, too--
</div>
--yet more stuff--
</body>
</html>

I've tried a few things. I know that XML::XPath and XML::XPath::XMLParser get me to the right place. I have an XPath expression that seems to work most of the time. The problem is that I want all the tags and everything--just as it currently is in the file. When I use methods like findvalue() or string_value(), I get just the text without the tags.

I tried HTML::TokeParser::Simple, but I wasn't sure how to do this. I'm hoping I don't have to write some loop that iterates over all the tags and text and prints them out bit by bit. I just want to say "keep everything from this point in the tree on down...".

Ideally, I want to do this without first fixing crappy, non-compliant HTML. I have lots of <p> tags that are used to separate paragraphs (instead of <p>foo</p>). I also have lots of <meta ... > tags instead of <meta... />. These unclosed tags tend to give XML parsers heartburn. I'll preprocess with tidy to make things tidy if I have to.

Update

I got a good enough result by using XML::XPath, XML::XPath::NodeSet, and XML::Parser. The trick seemed to be disentangling XML::Parser and XML::XPath. That is, I needed my own parser object which I used with XML::XPath. The entire script is 200 lines because of the vagaries of my specific input. But here's what I think is the salient bit that worked:

$m::xpath = '/html/body/table/tr/td/div';
my $parser = XML::Parser->new(
  'NoLWP' => 1,
  'NoExpand' =>1,
  'Namespaces' => 0);
my $XP = XML::XPath->new( filename => $inputfile, parser => $parser );
my $body = $XP->findnodes_as_string($m::xpath);

I ended up cheating because I discovered that the XPath expression above gets me the right div. There was a bit more uniformity on the pages (at least the pages I cared about) than I realised.

Thanks to all the suggestions

Comment on Extract Portion of HTML

Replies are listed 'Best First'.
Re: Extract Portion of HTML by Tux (Canon) on Sep 19, 2011 at 11:35 UTC
`use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_content ($html_text); foreach my $d ($tree->look_down (_tag => "div", class => "myBody")) { # $d is the div node. Use whatever you like with/on it $d->as_text =~ m/something/ and $d->delete; # To remove it. } $html_text = $tree->as_HTML (undef, " ", {});` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re: Extract Portion of HTML by moritz (Cardinal) on Sep 19, 2011 at 12:01 UTC
`use Mojo::DOM; print Mojo::DOM->new->parse($string)->at('.myBody')` [download] Works pretty well with crappy HTML too. Updated: changed from `body` to `.myBody`, thanks to Corion++ Perl 6 - second systems done right	[reply] [d/l] [select]
Re: Extract Portion of HTML by Anonymous Monk on Sep 19, 2011 at 11:44 UTC
my $html = '<html> --stuff-- <head> --more stuff-- </head> <body> --still more stuff-- <div class="myBody"> --all the stuff I want, which might include div tags, too-- </div> --yet more stuff-- </body> </html>'; use Web::Query qw(wq); say wq($html)->find('div.myBody')->html; # <div class="myBody"> --all the stuff I want, which might include div + tags, too-- </div> use HTML::Query 'Query'; say Query(text => $html)->query('div.myBody')->as_HTML; # <div class="myBody"> --all the stuff I want, which might include div + tags, too-- </div> use HTML::TreeBuilder qw(); say HTML::TreeBuilder->new_from_content($html)->look_down(_tag => 'div +', class => 'myBody')->as_HTML(q{}); # <div class="myBody"> --all the stuff I want, which might include div + tags, too-- </div> [download]	[reply] [d/l]
Re^2: Extract Portion of HTML by Corion (Patriarch) on Sep 19, 2011 at 12:27 UTC
In addition another variant on the same theme using HTML::TreeBuilder::XPath together with HTML::Selector::XPath: #!perl -w use strict; my $html = '<html> --stuff-- <head> --more stuff-- </head> <body> --still more stuff-- <div class="myBody"> --all the stuff I want, which might include div tags, too-- </div> --yet more stuff-- </body> </html>'; use HTML::Selector::XPath qw(selector_to_xpath); use HTML::TreeBuilder::XPath; my $t = HTML::TreeBuilder::XPath->new_from_content($html); my $q = selector_to_xpath('div.myBody'); print $_->as_HTML for ($t->findnodes($q)); # <div class="myBody"> --all the stuff I want, which might include div + tags, too-- </div> [download] The same should also be fairly simple using App::scrape, but the API currently does not allow for returning node elements (and the corresponding DOM tree), only plain text.	[reply] [d/l]
Re: Extract Portion of HTML by Rhandom (Curate) on Sep 20, 2011 at 14:44 UTC
I was interested in relative performance. Here are all of the modules listed so far in this thread as well as App::scrape and Web::Scraper. There is also a module called SGMLExtract which I wrote but have generally only used internal to the company. SGMLExtract is a regex based extractor which means that it requires having regularly formed HTML - not necessarily well formed. The other solutions are great for parsing from documents that could be poorly formed, but I haven't come across a situation where I have a legitimate reason to scrape information from poorly formed HTML. (I do use HTML::TreeBuilder for a module I'll be releasing to CPAN sometime in the next year - TreeBuilder is awesome at "being a browser"). I haven't released SGMLExtract to CPAN because I wasn't sure there is enough external demand (and we have here at least 7 modules filling the niche) but I could release it if there is enough interest. It is a whopping 90 lines of code with 0 dependencies. Of all of the outputs, the SGMLExtract one is the only one that does what the OP requested which is to pull the content of the div tag without the enclosing div. The Mojo::DOM one also failed to re-encapsulate the legacy bold tag. #!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese timethese); use App::scrape qw(scrape); use HTML::Query qw(Query); use HTML::Selector::XPath qw(selector_to_xpath); use HTML::TreeBuilder qw(); use HTML::TreeBuilder::XPath; use Mojo::DOM; use SGMLExtract qw(sgml_find sgml_extract); use Web::Query qw(wq); use Web::Scraper qw(process scraper); use Debug; my $html = q{<html> --stuff-- <head> --more stuff-- </head> <body> --still more stuff-- <div>Stuff I do not want</div> <div class="myBody"> --all the stuff <b>I</b> want, which might include div tags, too-- </div> --yet more stuff-- </body> </html> }; # appse and sgmle cheat because they go off relative position of the d +iv - not the class name sub m_appse { (scrape($html, ['div'], {class => 'myBody'}))[1]->[0] } sub m_hselx { (HTML::TreeBuilder::XPath->new_from_content($html)->find +nodes(selector_to_xpath('div.myBody')))[0]->as_HTML } sub m_htmlq { Query(text => $html)->query('div.myBody')->as_HTML } sub m_mojod { Mojo::DOM->new->parse($html)->at('.myBody')->text } sub m_sgmle { sgml_extract(\$html, 'div', {all => 1, content => 1})->[ +1]->{'content'} } sub m_sgmlf { sgml_find(\$html, 'div', {class => 'myBody'})->[0]->{'co +ntent'} } sub m_treeb { HTML::TreeBuilder->new_from_content($html)->look_down(_t +ag => 'div', class => 'myBody')->as_HTML(q{}) } sub m_webqy { wq($html)->find('div.myBody')->html } sub m_websc { (scraper { process "div.myBody", key => 'TEXT' }->scrape +($html))[0]->{'key'} } debug m_appse(), m_hselx(), m_htmlq(), m_mojod(), m_treeb(), m_sgmle() +, m_sgmlf(), m_webqy(), m_websc(); cmpthese timethese -1, { appse => \&m_appse, hselx => \&m_hselx, htmlq => \&m_htmlq, mojod => \&m_mojod, sgmle => \&m_sgmle, sgmlf => \&m_sgmlf, treeb => \&m_treeb, webqy => \&m_webqy, websc => \&m_websc, }; __END__ debug: paul/bench.pl line 45 m_appse() = "--all the stuff I want, which might include div tags, too +--"; m_hselx() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_htmlq() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_mojod() = "\n--all the stuff want, which might include div tags, to +o--\n"; m_treeb() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_sgmle() = "\n--all the stuff <b>I</b> want, which might include div +tags, too--\n"; m_sgmlf() = "\n--all the stuff <b>I</b> want, which might include div +tags, too--\n"; m_webqy() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_websc() = " --all the stuff I want, which might include div tags, to +o-- "; Rate webqy hselx websc appse htmlq treeb mojod sgmlf sgmle webqy 697/s -- -4% -5% -37% -47% -54% -72% -97% -98% hselx 724/s 4% -- -1% -35% -44% -52% -71% -97% -97% websc 731/s 5% 1% -- -34% -44% -51% -70% -97% -97% appse 1110/s 59% 53% 52% -- -15% -26% -55% -95% -96% htmlq 1305/s 87% 80% 78% 18% -- -13% -47% -94% -95% treeb 1506/s 116% 108% 106% 36% 15% -- -39% -93% -95% mojod 2465/s 254% 240% 237% 122% 89% 64% -- -89% -91% sgmlf 22330/s 3103% 2983% 2953% 1912% 1611% 1383% 806% -- -22% sgmle 28709/s 4018% 3864% 3825% 2486% 2100% 1807% 1065% 29% -- [download] my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply] [d/l]


Just another Perl shrine
	PerlMonks