Re: Simplify HTML programatically
by ww (Archbishop) on Jun 07, 2006 at 21:16 UTC
|
Demoronizer, tidy, and various other aps have been discussed here. Scott7477 is just beginning work on a tut on this general topic (and I'm allegedly helping, though in fact, planetscape has -- IMO -- done most of that by pulling together a formidable collection of refs, cites, etc.) while I may be of little value other than word-butchery. GrandFather has posted a WYSIWYG editor in CUFP which may be relevant and helpful. I could undoubtedly name many other relevant resources...
BUT...
Were I in your shoes, I would not allow "cut&pasting" of M$word (so-called) .html under any circumstances... for the very good reasons you outline... and perhaps even more important, would not allow use of any other than a very small subset of html tags for a whole range of taste-, security- and simplicity-reasons and just plain "old-fogey-ism."
On the other hand, the Monastery does provide minimally restrictive methods for a visitor to write some .html. Have you explored those methods?
| [reply] |
|
FWIW, my collection of links etc. may be found at the top of planetscape's extra scratchpad.
HTH,
Update 2006-06-10: Or, better, inline here, in case I later forget and delete/move something...
HTML TIDY
Clean up your Web pages with HTML TIDY
HTML Tidy Library Project
Charlie's Tidy Add-ons, including a Perl Wrapper
HTML tidy, using XML::LibXML
Demoronizer
Demoronizer - A perl script to sanitize Microsoft's HTML
Demoronizer - Correct Moronic Microsoft HTML
Word HTML 2 Formatting Objects
WH2FO is a Java application that processes an HTML output, created with Word 2000, and transforms it into an XML content file and an XSL stylesheet file. From these files, a standard XSLT processor may be used to obtain a file containing only XSL-FO markup. You can also apply a stylesheet that converts the XML back into HTML discarding all the extra markup added by Word. Using an XSL-FO renderer, such as FOP, you can also render your document into PDF.
PMEdit
GrandFather's PerlMonks Editor - on CPAN: PMEdit-001.000104-1.pl
Miscellaneous
HTML::Parser has several example programs, such as hstrip.pl which might be helpful
OpenOffice.org - free office suite and MSOffice alternative
Use OpenOffice to save in HTML format. OpenOffice creates much cleaner HTML, and the resulting file may still be run through HTML Tidy or a script such as hstrip.pl.
Word Processor Filters
gxml2html
| [reply] |
Re: Simplify HTML programatically
by davidrw (Prior) on Jun 07, 2006 at 21:11 UTC
|
A simple PM search for "word html" yielded these:
They're all a few years old, but at a quick glance looks like they could still be useful or at least point in some directions... You should try a Super Search as well. | [reply] |
Re: Simplify HTML programatically
by Joost (Canon) on Jun 07, 2006 at 21:37 UTC
|
htmltidy has specific options to clean up word-generated html. it's still not perfect, but it gets rid of the worst. Take special notice of the "word-2000" and "bare" options.
| [reply] |
Re: Simplify HTML programatically
by trwww (Priest) on Jun 08, 2006 at 06:29 UTC
|
I use SAX for this. For me it provides the best performance/maintainability ratio.
Here is a standalone program that takes a html file name as an argument and prints the output to STDOUT.
Its purpose is to strip all tags except p, div, and a tags.
The program driver sets up the SAX pipeline. It uses XML::Driver::HTML as the SAX driver and XML::SAX::Writer as the writer.
When parse is called on the driver, the Pipeline sends the driver's stream to some helper modules, to our filter module, and finally the writer.
In the custom filter, start_element is called for each tag. The code checks to see if the tag is either an a, div, or p tag, and if it is, it forwards the tag to the writer. Otherwise, the tag is ignored and removed from the stream.
The same work needs to happen in the end_element callback.
- benefits:
- Accepts non-well-formed html and outputs well formed xml.
- Scales well. In theory it should use a constant amount of memory.
- High level of control of the output document
- drawbacks:
- need to learn SAX
- if you remove a node from the stream in the start_element callback you have to remove the closing tag in the end_element callback (this isn't so much of a "drawback" but more of a reminder that you have to stay on your toes).
use warnings;
use strict;
use XML::SAX::Machines qw(Pipeline);
use XML::Driver::HTML;
use XML::Filter::SAX1toSAX2;
use XML::Filter::BufferText;
use XML::SAX::Writer;
my $output; # transformation target
my $writer = XML::SAX::Writer->new( Output => \$output );
my $machine = Pipeline(
'XML::Filter::SAX1toSAX2' =>
'XML::Filter::BufferText' =>
'XML::Filter::HtmlTagStripper' =>
$writer
);
my $html = new XML::Driver::HTML(
Handler => $machine,
Source => { SystemId => $ARGV[0] }
);
$html->parse();
print $output;
package XML::Filter::HtmlTagStripper;
use base qw|XML::SAX::Base|;
# <marker language="foo" />
# $el->{Name} == 'marker'
# $el->{Attributes}{'{}language'} == language attribute
# $el->{Attributes}{'{}language'}{Value} == 'foo'
sub start_element {
my($self, $el) = @_;
if ( $el->{Name} =~ m/^(?:p|div|a)$/i ) {
$self->SUPER::start_element( $el );
}
}
sub end_element {
my($self, $el) = @_;
if ( $el->{Name} =~ m/^(?:p|div|a)$/i ) {
$self->SUPER::end_element( $el );
}
}
1;
I know its not the most un-rocket-sciencey thing in the world, but its not too tricky, and once it clicks in your head and you realize that this is high performance xml parsing, the possibilities are boggling.
Lets take a look at a run:
$ cat striptags.html
<html>
<head>
<title>Test Document</title>
</head>
<body>
<p>The first paragraph</p>
<p>the second paragraph</p>
<hr width="75%">
<div>last modified: WHENEVER</div>
</body>
</html>
$ perl striptags.pl striptags.html
Test Document<p>The first paragraph</p><p>the second paragraph</p><div
+>last modified: WHENEVER</div>
Enjoy,
Todd W.
| [reply] [d/l] [select] |
|
For something of a simpler* solution, but in the same vein, there's HTML::TreeBuilder. HTML::Element provides all of the primitives that you really need for an operation like this: look_down to identify relevant elements, replace_with_content to "remove" a tag without removing what it contains, and delete to completely destroy all signs of a given element. I'm not up to writing an example right now, but it's truly simple. Give it a shot! It goes a long way, and the output is bound to be less of a mess than the input.
* edit: okay, I realized that some might be confused by this usage of simple, since trwww's example is pretty simple in itself. Mostly it's a matter of being allowed to think in terms of tree manipulations instead of opens and closes and stacking and de-stacking. The corresponding cost is in storage, but it's usually not worrisome.
| [reply] |
Re: Simplify HTML programatically
by hpavc (Acolyte) on Jun 08, 2006 at 01:15 UTC
|
I am afraid to do this right, you would have to nearly transform the HTML fragment that word generates when doing this. It is chop full of div's last I look that represent tons of waste and some that were integral to basic formatting. | [reply] |
Re: Simplify HTML programatically
by Rhandom (Curate) on Jun 08, 2006 at 15:00 UTC
|
| [reply] |
Re: Simplify HTML programatically
by DaWolf (Curate) on Jun 09, 2006 at 04:31 UTC
|
I have to agree with ww. Since you have "JavaScript Power" in your website it would be relatively simple to prevent "CTRL + C, CTRL + V" and other key combinations.
Incidentally, have you tried FCKEditor? AFAIK it's the best HTML on-line editor out there.
Just my two cents.
| [reply] |
|
agreed, FCKEditor handles WYSIWYG html editing from within a webpage quite well, and it does handle MSWord html as well.
| [reply] |
Re: Simplify HTML programatically
by blahblah (Friar) on Jun 12, 2006 at 06:25 UTC
|
| [reply] |
Re: Simplify HTML programatically
by freddo411 (Chaplain) on Jun 12, 2006 at 20:53 UTC
|
In a somewhat related vein, if you want to remove some of the special characters that MSword creates you can use the following code which is based on the excellent
demoronizer
code.
sub nukeMSsmarts {
my $s = shift;
# Map incompatible CP-1252 characters
$s =~ s/\x82/,/g;
$s =~ s-\x83-<em>f</em>-g;
$s =~ s/\x84/,,/g;
$s =~ s/\x85/.../g;
$s =~ s/\x88/^/g;
$s =~ s-\x89- °/°°-g;
$s =~ s/\x8B/</g;
$s =~ s/\x8C/Oe/g;
$s =~ s/\x91/'/g;
$s =~ s/\x92/'/g;
$s =~ s/\x93/"/g;
$s =~ s/\x94/"/g;
$s =~ s/\x95/*/g;
$s =~ s/\x96/-/g;
$s =~ s/\x97/--/g;
$s =~ s-\x98-<sup>~</sup>-g;
$s =~ s-\x99-<sup>TM</sup>-g;
$s =~ s/\x9B/>/g;
$s =~ s/\x9C/oe/g;
# Now check for any remaining untranslated characters.
$s =~ s/[\x00-\x08\x10-\x1F\x80-\x9F]/*/g;
return $s;
}
-------------------------------------
Nothing is too wonderful to be true
-- Michael Faraday
| [reply] [d/l] |
Re: Simplify HTML programatically
by Anonymous Monk on Nov 26, 2007 at 14:30 UTC
|
Many people here advocate a strategy that is void in the original question, ie to decide what tags to remove. The original question wanted a techique to remove ALL tags with some exceptions. I found this page while looking for a way to accomplish exactly the same thing.
Ive read many warnings against parsing html with regular expressions but for this task, are they still valid?
My predicament is that im changing wysiwyg editor on an old cms system to fck. However, for some of the messy old meterial i need to clean everything except some tags (the old editor used a strategy of "hugging the text" so there is ALOT of unknown stuff on hundereds and hundereds of pages).
Problem is, my brain is too small to formulate the regular expression needed.
/nic_tester
| [reply] |
|
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $html_in = do{local $/;<DATA>};
my $p = HTML::TokeParser::Simple->new(\$html_in) or die qq{cant parse
+html\n};
my $html_out;
my $re = qr/html|head|title|body|p|img/;
while (my $t = $p->get_token){
if (not $t->is_tag()){
$html_out .= $t->as_is;
}
elsif ($t->is_tag($re)){
$html_out .= $t->as_is;
}
}
print qq{$html_out\n};
__DATA__
<html>
<head>
<title>title</title>
</head>
<body>
<p>one <b>two</b> <i>three</i></p>
<p><img src="four.gif" alt="img"> <a href="five.html">five</a></p>
<p><font>six</font></p>
</body>
</html>
output:
<html>
<head>
<title>title</title>
</head>
<body>
<p>one two three</p>
<p><img src="four.gif" alt="img"> five</p>
<p>six</p>
</body>
</html>
Post a new question if this isn't what you meant or if you want more information. | [reply] [d/l] [select] |
|
Many thanx for your reply.
Well, the snippet you posted is more or less what I need. However, I dont know perl and it looks more serverside than clientside. My need:
I need to clean all html-tags from a string with some exception-tags. I can only define the exceptions, not the tags to clean. This must be accomplished clientside, preferably javascript, possibly javascript dom even thou then ill be well out of my depth.
Background (read it or not, its verbose):
entering windows vista the active-x wysiwyg html editor on a content management system im hosting stopped working. My job was to integrate the fck instead of the activex.
The old component never removed copy-pasted tags from word etc, it just "hugged the text" with its own fonttags, thus hiding loads of garbage thats still in the db. fck cannot support these css fonttags in sufficiently userfriendly manner so, in the future im using h1,h2,h3 and such together with css. However, when a user wishes to edit stuff that was produced with the old editor i fret that there might be inconsistences between h1 (fck) and font class=r1(old wysiwyg).
And, further, if I exchange normal text markup in old editor(font class="f1") with that of fck(nothing) all the junk that has been copy-pasted into the cms system and then been hidden by hug-the-text-fonttags will suddenly surface. Thus, i want to nuke everything, on usercommand, except for stuff like links, linebreaks, tables, images paragraphtags etcetera.
/nic
| [reply] |
|
|
|