http://qs321.pair.com?node_id=124796


in reply to Re: (crazyinsomniac) Re: Extract info from HTML
in thread Extract info from HTML

Well, even though this wasnt addressed to me:

Mine will extract all the above information just change the following lines

print "($depth)$monkname posted '$monkname' on $date\n"; $hashref->{$monkname}->{$node_id}={ date=>$date, title=>$title, depth=>$depth };
Then you can extract whatever you want.
$VAR1 = { 'demerphq' => { '110238' => { 'depth' => '13', 'title' => 'Corions Name Space +', 'date' => 'Sep 05, 2001 at 01: +04' }, 'Home' => '108447', '110195' => { 'depth' => '12', 'title' => 'Re: Name Space', 'date' => 'Sep 04, 2001 at 15: +46' } }, 'George_Sherston' => { 'Home' => '103111', '124767' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Nov 11, 2001 + at 22:33' }, 'Name Space' => { 'depth' => '9', 'title' => 'Name Sp +ace', 'date' => 'Sep 04, +2001 at 13:33' }, '121046' => { 'depth' => '14', 'title' => 'Re: Re: Re: + Name Space', 'date' => 'Oct 24, 2001 + at 01:21' }, '117665' => { 'depth' => '13', 'title' => 'Re: TheOrbT +wo\'s Name Space', 'date' => 'Oct 09, 2001 + at 00:05' }, '117303' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Oct 07, 2001 + at 03:57' }, '110244' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Sep 05, 2001 + at 01:58' }, '122854' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Nov 02, 2001 + at 08:07' } }, };
Note that the depths are as follows:9 root node, 12 reply, 13, reply to a reply...
But a thought: You dont want the posts from just a fixed depth in the parse tree. That would for instance eliminate you from the list (you dont have a reply to yourself) as well as anyone who explained their name in a reply to another persons explaination, merphq would be an example, however I believe there are more as well.

Actually, one of the more interesting issues with this thread was acurately picking up all names from all levels, there is an annoying habit of <UL> tags messing up the pattern, also of the main post being marked up differently.

Anyway, Ill revisit this a bit later, :-)

Yves / DeMerphq
--
Have you registered your Name Space?

Replies are listed 'Best First'.
Re: Re: Re: (crazyinsomniac) Re: Extract info from HTML
by George_Sherston (Vicar) on Nov 12, 2001 at 15:09 UTC
    Those whom the gods would destroy they first interest them in parsing natural language... really, in order to come up with a satisfactory solution to this, we're going to have to find a way to distinguish between the content of nodes... we need a script that can make an intelligent guess whether the node is a response or an etymology. This is a bit too rich for my blood, but I look forward to seeing it done :)

    George Sherston