Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Extract text from repeated html parts

by vit (Friar)
on Aug 03, 2011 at 20:04 UTC ( [id://918373]=perlquestion: print w/replies, xml ) Need Help??

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I need to extract texts from repeated html patterns. A good example is this site homeandfamilynetwork dot com
Html is given below.
Probably a repeating pattern would be here:
<p class="entry-summary"> t e x t <div
Once I get this portion I will get rid of html and purify text easily using a filter.
I know that in general this is not an easy task and someone can recommend, for example, to use HTML::ContentExtractor as a help. But if somebody already did something similar I would appreciate if you share the code with me.
<div class="article box medium"> <div class="header"><a class="cate +gory" href="/home-improvement/" rel="tag">Home Improvement</a><h2 cla +ss="entry-title first-item"><a href="http://www.homeandfamilynetwork. +com/home-improvement/organization/get-that-closet-organized/740">Get +That Closet Organized</a></h2> </div> <p class="entry-summary"> <p>Cluttered closet? Can't find th +at coat you have because it's hidden under mountains of junk? We have + the solution.<div class="footer"><a href="http://www.homeandfamilyne +twork.com/blog/better-homes-and-gardens-decorating/161">Better Homes +and Gardens ...</a> On <abbr title="January 13, 2011">January 13, 201 +1</abbr> </div> </div> <li> <div class="article box medium"> <div class="header"><a class="cate +gory" href="/fitness/" rel="tag">Fitness</a><h2 class="entry-title">< +a href="http://www.homeandfamilynetwork.com/fitness/weight-loss/8-die +t-rules-you-should-be-breaking/741">8 Diet Rules You Should Be Breaki +ng</a></h2> </div> <p class="entry-summary">Trying to + lose weight and listening to everything the internet or your friends + say? Maybe it's time to stop listening and start eating. <div class= +"footer"><a href="http://www.homeandfamilynetwork.com/blog/fitnessmag +azinecom/168">FitnessMagazine.com</a> On <abbr title="January 13, 201 +1">January 13, 2011</abbr> </div> </div> <li>

Replies are listed 'Best First'.
Re: Extract text from repeated html parts
by moritz (Cardinal) on Aug 03, 2011 at 20:19 UTC
    Something like
    use Mojo::DOM; my $dom = Mojo::DOM->new->parse($text); for ($dom->find('p.entry-summary')->each) { print $_->all_text; }

    (untested, but should work with only minor modifications).

Re: Extract text from repeated html parts
by bluescreen (Friar) on Aug 03, 2011 at 20:20 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://918373]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2024-03-28 09:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found