Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

The problem arose when one of our customers wanted to feed formated text from a word97 doc file into an RDBMS to generate web pages with dynamic content. So I exported the text as HTML from word, opened the file in a text editor and was confronted with a horror of HTML (you may know that).

My first approach was to use HTML::Parser and a modified version of one of its example scripts to drop some tags (like <font>). HTML::Parser did a good job on that but left ugly things like <b><i> ... </b></i>, which isn't valid.

So I took a look at HTML::TreeBuilder and wrote the following sub to do the work. It works fine, but I want to ask my fellow monks for deeper knowledge.

Are there other ways to handle word's html output and get valid html from it? Please give me some directions (others than htmltidy which can't be used). Thanks.

# ... snippet ... # tags to ignore my @ignore_tags = qw(font big small body dir html); # teags to drop with content @ignore_elements = qw(script style head); ########################################################## sub clean_up_htmltree { ########################################################## my $input = shift; my $warn = 0; my $htmlex; use HTML::TreeBuilder; my $h = HTML::TreeBuilder->new; $h->ignore_unknown(0); $h->warn($warn); $h->parse($input); foreach (@ignore_tags) { $htmlex = 1, next if lc($_) eq "html"; # remove <html>...</html>? while (my $ok = $h->look_down('_tag', "$_")) { $ok->replace_with_content; } } foreach (@ignore_elements) { while (my $ok = $h->look_down('_tag', "$_")) { $ok->detach; } } my $output = $h->as_HTML(undef, " ", {}); # entities to encode, inde +nt, optional endtags $h = $h->delete(); # nuke it! if ($htmlex) { $output =~ s:^\s*<html>::m; $output =~ s:</html>\s*$::m; } return $output; }

alex pleiner <alex@zeitform.de>
zeitform Internet Dienste


In reply to Converting Word97 (or later) exported HTML to valid HTML by projekt21

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-04-19 04:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found