Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Converting and cleaning Word's HTML export to valid HTML

by projekt21 (Friar)
on Nov 08, 2001 at 16:39 UTC ( #124049=snippet: print w/replies, xml ) Need Help??
Description:

After discussing the topic of converting Word's HTML horror into valid and clean HTML in node Converting Word97 (or later) exported HTML to valid HTML and CB, I decided to further use a solution using that excellent HTML::TreeBuilder module.

Its advantage over tidy and others is that I can configure which tags, elements and attributes to drop. I also noticed that tidy surrendered over some special word files while HTML::TreeBuilder did not. The drawback is speed, of course.

Just to complete, here some directions to alternative solutions given by fellow monks (thanks to all):

alex pleiner <alex@zeitform.de>
zeitform Internet Dienste

#### configuration ####
# attributes to ignore
my @ignore_attr =
    qw(bgcolor background color face style link alink 
       vlink text onblur onchange onclick ondblclick 
       onfocus onkeydown onkeyup onload onmousedown 
       onmousemove onmouseout onmouseover onmouseup
       onreset onselect onunload class xmlns:w xmlns:o 
       xmlns
      );

# tags to ignore
my @ignore_tags = 
    qw(font big small body dir html div span);

# tags to drop with content
my @ignore_elements = 
    qw(script style head o:p);


############################################################
sub clean_up_htmltree {
############################################################

  my $input = shift;
  my $warn = 0; 
  my $htmlex; 
  use HTML::TreeBuilder;

  my $h = HTML::TreeBuilder->new;
  $h->ignore_unknown(0);
  $h->warn($warn);
  $h->parse($input);

  # drop all unwanted tags
  foreach (@Conf::ignore_tags) {
    $htmlex = 1, next if lc($_) eq "html"; # remove <html>...</html>?
    while (my $ok = $h->look_down('_tag', "$_")) { 
      $ok->replace_with_content; 
    }
  }

  # drop all unwanted elements (tags w/ content)
  foreach (@Conf::ignore_elements) {
    while (my $ok = $h->look_down('_tag', "$_")) { 
      $ok->detach; 
    }
  }

  # drop all unwanted attributes
  foreach my $attr (@Conf::ignore_attr) {
    while (my $ok = $h->look_down( sub { defined($_[0]->attr($attr)) }
+ )) { 
      $ok->attr($attr, undef);
    }
  }

  # drop unwanted script code <![....]>
  foreach my $ok ( $h->look_down( sub { grep { /^<\s*!\[.+?\]\s*>$/ } 
+$_[0]->content_list } ) ) {
    $ok->detach_content; 
  }


  my $output = $h->as_HTML(undef, " ", {}); 
  # params = entities to encode, indent, optional endtags
  $h = $h->delete(); # nuke it!
  if ($htmlex) {
    $output =~ s:^\s*<html>::m;
    $output =~ s:</html>\s*$::m;
  }
  return $output;
}
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: snippet [id://124049]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2021-10-24 00:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (88 votes). Check out past polls.

    Notices?