Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Well, to be mildly critical there are two (or more depending on how you look at it) scenarios where we can/need to extract information from. Yours only matches one fixed version (yes i know it was deliberate decision :-) now for the record (and in rehersal for that tutorial you suggested :-) Ill list the others:
# Main node on page (Top most) <TD valign=middle> <H3>Name Space</H3> <FONT size=2> ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166">George_S +herston</A> ' on Sep 04, 2001 at 13:33' </FONT> </TD> # Primary Reply (crazyinsomniacs pattern) <TD colspan=2> <font size=2> <A HREF="/index.pl?node_id=110195&lastnode_id=110166">Re: Name + Space</A> <BR> ' by ' <A HREF="/index.pl?node_id=108447&lastnode_id=110166">demerphq +</A> ' on Sep 04, 2001 at 15:46' </font> </TD> # Note that the <UL> tag is incorrectly nested with regards to the <FO +NT> tag # Reply to a reply <TD colspan=2> <UL> <font size=2> <A HREF="/index.pl?node_id=110244&lastnode_id=110166">Re: +Re: Name Space</A> <BR> ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166">Geor +ge_Sherston</A> ' on Sep 05, 2001 at 01:58' </UL> </font> # Reply to a reply of a reply # each extra layer of depth has an extra <UL> tag inserted <TD colspan=2> <UL> <UL> <font size=2> <A HREF="/index.pl?node_id=121046&lastnode_id=110166"> +Re: Re: Re: Name Space</A> <BR> ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166"> +George_Sherston</A> ' on Oct 24, 2001 at 01:21' </UL> </UL> </font> </TD>
Note the buggy HTML? :-)

So what I did was look for the content of the FONT tag. If it matches a 'finger print' for one of the following two

<font size=2> # Optional part begins <A HREF="/index.pl?node_id=121046&lastnode_id=110166">Re: Re: Re +: Name Space</A> <BR> # Optional part ends ' by ' <A HREF="/index.pl?node_id=103111&lastnode_id=110166">George_She +rston</A> ' on Oct 24, 2001 at 01:21' </font>
Then I do a few more checks to make sure it isnt a spurious match, if they pass then I consider it the title/author/date of the node. A bit of extraction of the tags attributes and presto, we have the home node and post node ids. (With the exception of the main post, where we can only extract the title, not the ID)

This would be sooooo much easier if there were class attributes in the tags, such as <TD class="post">, but considering the buggy HTML, I suppose class attributes are low on the priority list. (BTW, cant wait to join the PM dev team, id like to have a crack at cleaning up some of the HTML, now that im getting into parsing it :-)

Yves / DeMerphq
--
Have you registered your Name Space?


In reply to Re: Re: Re: (crazyinsomniac) Re: Extract info from HTML by demerphq
in thread Extract info from HTML by George_Sherston

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-19 19:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found