Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I was interested in relative performance. Here are all of the modules listed so far in this thread as well as App::scrape and Web::Scraper.

There is also a module called SGMLExtract which I wrote but have generally only used internal to the company. SGMLExtract is a regex based extractor which means that it requires having regularly formed HTML - not necessarily well formed. The other solutions are great for parsing from documents that could be poorly formed, but I haven't come across a situation where I have a legitimate reason to scrape information from poorly formed HTML. (I do use HTML::TreeBuilder for a module I'll be releasing to CPAN sometime in the next year - TreeBuilder is awesome at "being a browser"). I haven't released SGMLExtract to CPAN because I wasn't sure there is enough external demand (and we have here at least 7 modules filling the niche) but I could release it if there is enough interest. It is a whopping 90 lines of code with 0 dependencies.

Of all of the outputs, the SGMLExtract one is the only one that does what the OP requested which is to pull the content of the div tag without the enclosing div. The Mojo::DOM one also failed to re-encapsulate the legacy bold tag.

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese timethese); use App::scrape qw(scrape); use HTML::Query qw(Query); use HTML::Selector::XPath qw(selector_to_xpath); use HTML::TreeBuilder qw(); use HTML::TreeBuilder::XPath; use Mojo::DOM; use SGMLExtract qw(sgml_find sgml_extract); use Web::Query qw(wq); use Web::Scraper qw(process scraper); use Debug; my $html = q{<html> --stuff-- <head> --more stuff-- </head> <body> --still more stuff-- <div>Stuff I do not want</div> <div class="myBody"> --all the stuff <b>I</b> want, which might include div tags, too-- </div> --yet more stuff-- </body> </html> }; # appse and sgmle cheat because they go off relative position of the d +iv - not the class name sub m_appse { (scrape($html, ['div'], {class => 'myBody'}))[1]->[0] } sub m_hselx { (HTML::TreeBuilder::XPath->new_from_content($html)->find +nodes(selector_to_xpath('div.myBody')))[0]->as_HTML } sub m_htmlq { Query(text => $html)->query('div.myBody')->as_HTML } sub m_mojod { Mojo::DOM->new->parse($html)->at('.myBody')->text } sub m_sgmle { sgml_extract(\$html, 'div', {all => 1, content => 1})->[ +1]->{'content'} } sub m_sgmlf { sgml_find(\$html, 'div', {class => 'myBody'})->[0]->{'co +ntent'} } sub m_treeb { HTML::TreeBuilder->new_from_content($html)->look_down(_t +ag => 'div', class => 'myBody')->as_HTML(q{}) } sub m_webqy { wq($html)->find('div.myBody')->html } sub m_websc { (scraper { process "div.myBody", key => 'TEXT' }->scrape +($html))[0]->{'key'} } debug m_appse(), m_hselx(), m_htmlq(), m_mojod(), m_treeb(), m_sgmle() +, m_sgmlf(), m_webqy(), m_websc(); cmpthese timethese -1, { appse => \&m_appse, hselx => \&m_hselx, htmlq => \&m_htmlq, mojod => \&m_mojod, sgmle => \&m_sgmle, sgmlf => \&m_sgmlf, treeb => \&m_treeb, webqy => \&m_webqy, websc => \&m_websc, }; __END__ debug: paul/bench.pl line 45 m_appse() = "--all the stuff I want, which might include div tags, too +--"; m_hselx() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_htmlq() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_mojod() = "\n--all the stuff want, which might include div tags, to +o--\n"; m_treeb() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_sgmle() = "\n--all the stuff <b>I</b> want, which might include div +tags, too--\n"; m_sgmlf() = "\n--all the stuff <b>I</b> want, which might include div +tags, too--\n"; m_webqy() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_websc() = " --all the stuff I want, which might include div tags, to +o-- "; Rate webqy hselx websc appse htmlq treeb mojod sgmlf sgmle webqy 697/s -- -4% -5% -37% -47% -54% -72% -97% -98% hselx 724/s 4% -- -1% -35% -44% -52% -71% -97% -97% websc 731/s 5% 1% -- -34% -44% -51% -70% -97% -97% appse 1110/s 59% 53% 52% -- -15% -26% -55% -95% -96% htmlq 1305/s 87% 80% 78% 18% -- -13% -47% -94% -95% treeb 1506/s 116% 108% 106% 36% 15% -- -39% -93% -95% mojod 2465/s 254% 240% 237% 122% 89% 64% -- -89% -91% sgmlf 22330/s 3103% 2983% 2953% 1912% 1611% 1383% 806% -- -22% sgmle 28709/s 4018% 3864% 3825% 2486% 2100% 1807% 1065% 29% --


my @a=qw(random brilliant braindead); print $a[rand(@a)];

In reply to Re: Extract Portion of HTML by Rhandom
in thread Extract Portion of HTML by pacohope

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others examining the Monastery: (8)
    As of 2020-11-25 15:02 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?