Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: HTML::TreeBuilder: sort a Definition List (<dl>)

by skillet-thief (Friar)
on Sep 12, 2005 at 19:25 UTC ( [id://491343]=note: print w/replies, xml ) Need Help??


in reply to HTML::TreeBuilder: sort a Definition List (<dl>)

I'm not sure I understand your question. The code you already have seems to do most of the tricky stuff, ie. getting the data out of the html.

If I were doing this (but I'm not fast enough to just whip out code right now), I think I would delete the <dt> and <dd> objects as I read them (there are a couple of methods for doing this, IIRC). Then I would sort them as HTML::Element objects, using a big Schwartzian Transform. Once you get an array of sorted HTML::Element objects, you can reattach the whole thing into the dl.

Assuming that is what you wanted to do... ;-)

Good luck.

sub jf { print substr($_[0], -1); jf( substr($_[0], 0, length($_[0])-1)) if length $_[0] > 1; } jf('gro.alubaf@yehaf');

Replies are listed 'Best First'.
Re^2: HTML::TreeBuilder: sort a Definition List (<dl>)
by Util (Priest) on Sep 13, 2005 at 01:54 UTC

    ++skillet-thief, I agree with your design; the code at the bottom implements it. It is slightly more complex, to handle the tags other than DT and DD that can exist in the DL.

    Notable issues in the OP code:

    • $tree->destroy should be $tree->delete.
    • You use $tree->parse without using $tree->eof! From the HTML::TreeBuilder docs:
      $root->eof()
      This signals that you're finished parsing content into this tree; this runs various kinds of crucial cleanup on the tree. This is called for you when you call $root->parse_file(...), but not when you call $root->parse(...). So if you call $root->parse(...), then you must call $root->eof() once you've finished feeding all the chunks to parse(...), and before you actually start doing anything else with the tree in $root.
      Using new_from_content or new_from_file would also prevent the problem.
    • You say:
      my ($dl) = $tree->look_down('_tag', 'dl');
      This means "scan *everywhere* in $tree to find all the DL tags, and put the first DL tag found into $dl". Why ask for them all and take the first? Instead, ask for *only* the first DL, by calling look_down in scalar context.
      my $dl = $tree->look_down('_tag', 'dl');

    Working, tested code:

    #!/usr/bin/perl -W use strict; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content(<<'END') or die; <html> <head> <title>Glossary</title> <h1>Glossary</h1> <dl> <dt><b>E Definition</b></dt> <dd>E - data</dd> <p></p> <dt><b>B Definition</b></dt> <dd>B - data</dd> <p></p> <dt><b>A_definition</b></dt> <dd>A data.</dd> <p></p> <dt><b>C definition</b></dt> <dd>C - data</dd> <p></p> </dl> </body> </html> END my $dl = $tree->look_down( _tag => 'dl' ); # Unlink all of $dl's children from $dl, and return them. my @dl_content = $dl->detach_content(); # Group the tags into an AoA on the DT tag. my @dt_tag_clusters; foreach (@dl_content) { push @dt_tag_clusters, [] if $_->tag() eq 'dt'; die "Tags occured before first DT" unless @dt_tag_clusters; push @{ $dt_tag_clusters[-1] }, $_; } # Sort the clusters @dt_tag_clusters = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { [ $_->[0]->as_HTML, $_ ] } @dt_tag_clusters; # Un-cluster the tags. @dl_content = map { @$_ } @dt_tag_clusters; # Replace the DL's content with the sorted tags. $dl->push_content( @dl_content ); print $tree->as_HTML; # or use HTML::PrettyPrinter $tree = $tree->delete();

      Thanks everybody!
      @Util - perfect! Exactly what I was looking for!
      I really liked the way you created the clusters. Then it took me some time to understand the map-sort-map (unitl I found it in the cookbook) and the un-clustering (well, I didn't really understand that one, but can take it as given).

      Not only did you solve my problem, but you also greatly enhanced my understanding of Perl and added to my toolbox of solutions to common problems!

      One small note though:
      Mapping like this: map  { [ $_->[0]->as_HTML, $_ ] } leads to problems when you have more tags in the dt element (some are links as well), thus it's better to map  { [ $_->[0]->as_text, $_ ] } or even to apply some more calculations on the text like lc and (at least in Germany) Umlaut considerations.

      More than happy,
      svenXY

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://491343]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-04-18 15:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found