comment on

I was interested in relative performance. Here are all of the modules listed so far in this thread as well as App::scrape and Web::Scraper.

There is also a module called SGMLExtract which I wrote but have generally only used internal to the company. SGMLExtract is a regex based extractor which means that it requires having regularly formed HTML - not necessarily well formed. The other solutions are great for parsing from documents that could be poorly formed, but I haven't come across a situation where I have a legitimate reason to scrape information from poorly formed HTML. (I do use HTML::TreeBuilder for a module I'll be releasing to CPAN sometime in the next year - TreeBuilder is awesome at "being a browser"). I haven't released SGMLExtract to CPAN because I wasn't sure there is enough external demand (and we have here at least 7 modules filling the niche) but I could release it if there is enough interest. It is a whopping 90 lines of code with 0 dependencies.

Of all of the outputs, the SGMLExtract one is the only one that does what the OP requested which is to pull the content of the div tag without the enclosing div. The Mojo::DOM one also failed to re-encapsulate the legacy bold tag.

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw(cmpthese timethese);

use App::scrape qw(scrape);
use HTML::Query qw(Query);
use HTML::Selector::XPath qw(selector_to_xpath);
use HTML::TreeBuilder qw();
use HTML::TreeBuilder::XPath;
use Mojo::DOM;
use SGMLExtract qw(sgml_find sgml_extract);
use Web::Query qw(wq);
use Web::Scraper qw(process scraper);

use Debug;

my $html = q{<html>
--stuff--
<head>
--more stuff--
</head>
<body>
--still more stuff--
<div>Stuff I do not want</div>
<div class="myBody">
--all the stuff <b>I</b> want, which might include div tags, too--
</div>
--yet more stuff--
</body>
</html>
};

# appse and sgmle cheat because they go off relative position of the d
+iv - not the class name
sub m_appse { (scrape($html, ['div'], {class => 'myBody'}))[1]->[0] }
sub m_hselx { (HTML::TreeBuilder::XPath->new_from_content($html)->find
+nodes(selector_to_xpath('div.myBody')))[0]->as_HTML }
sub m_htmlq { Query(text => $html)->query('div.myBody')->as_HTML }
sub m_mojod { Mojo::DOM->new->parse($html)->at('.myBody')->text }
sub m_sgmle { sgml_extract(\$html, 'div', {all => 1, content => 1})->[
+1]->{'content'} }
sub m_sgmlf { sgml_find(\$html, 'div', {class => 'myBody'})->[0]->{'co
+ntent'} }
sub m_treeb { HTML::TreeBuilder->new_from_content($html)->look_down(_t
+ag => 'div', class => 'myBody')->as_HTML(q{}) }
sub m_webqy { wq($html)->find('div.myBody')->html }
sub m_websc { (scraper { process "div.myBody", key => 'TEXT' }->scrape
+($html))[0]->{'key'} }

debug m_appse(), m_hselx(), m_htmlq(), m_mojod(), m_treeb(), m_sgmle()
+, m_sgmlf(), m_webqy(), m_websc();

cmpthese timethese -1, {
    appse => \&m_appse,
    hselx => \&m_hselx,
    htmlq => \&m_htmlq,
    mojod => \&m_mojod,
    sgmle => \&m_sgmle,
    sgmlf => \&m_sgmlf,
    treeb => \&m_treeb,
    webqy => \&m_webqy,
    websc => \&m_websc,
};

__END__

debug: paul/bench.pl line 45
m_appse() = "--all the stuff I want, which might include div tags, too
+--";
m_hselx() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi
+ch might include div tags, too-- </div>";
m_htmlq() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi
+ch might include div tags, too-- </div>";
m_mojod() = "\n--all the stuff  want, which might include div tags, to
+o--\n";
m_treeb() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi
+ch might include div tags, too-- </div>";
m_sgmle() = "\n--all the stuff <b>I</b> want, which might include div 
+tags, too--\n";
m_sgmlf() = "\n--all the stuff <b>I</b> want, which might include div 
+tags, too--\n";
m_webqy() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi
+ch might include div tags, too-- </div>";
m_websc() = " --all the stuff I want, which might include div tags, to
+o-- ";

         Rate webqy hselx websc appse htmlq treeb mojod sgmlf sgmle
webqy   697/s    --   -4%   -5%  -37%  -47%  -54%  -72%  -97%  -98%
hselx   724/s    4%    --   -1%  -35%  -44%  -52%  -71%  -97%  -97%
websc   731/s    5%    1%    --  -34%  -44%  -51%  -70%  -97%  -97%
appse  1110/s   59%   53%   52%    --  -15%  -26%  -55%  -95%  -96%
htmlq  1305/s   87%   80%   78%   18%    --  -13%  -47%  -94%  -95%
treeb  1506/s  116%  108%  106%   36%   15%    --  -39%  -93%  -95%
mojod  2465/s  254%  240%  237%  122%   89%   64%    --  -89%  -91%
sgmlf 22330/s 3103% 2983% 2953% 1912% 1611% 1383%  806%    --  -22%
sgmle 28709/s 4018% 3864% 3825% 2486% 2100% 1807% 1065%   29%    --
[download]

my @a=qw(random brilliant braindead); print $a[rand(@a)];

In reply to Re: Extract Portion of HTML by Rhandom
in thread Extract Portion of HTML by pacohope

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Your skill will accomplish what the force of many cannot
	PerlMonks