in reply to Re: Simplify HTML programatically
in thread Simplify HTML programatically
Ive read many warnings against parsing html with regular expressions but for this task, are they still valid?You can parse html with a regex but, imo, its tricky. I always reach for a parser. There are many and monks recommend different modules. fwi I tend to stick to HTML::TokeParser::Simple.
Perhaps something like this (it even has a regex):
output:#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html_in = do{local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(\$html_in) or die qq{cant parse +html\n}; my $html_out; my $re = qr/html|head|title|body|p|img/; while (my $t = $p->get_token){ if (not $t->is_tag()){ $html_out .= $t->as_is; } elsif ($t->is_tag($re)){ $html_out .= $t->as_is; } } print qq{$html_out\n}; __DATA__ <html> <head> <title>title</title> </head> <body> <p>one <b>two</b> <i>three</i></p> <p><img src="four.gif" alt="img"> <a href="five.html">five</a></p> <p><font>six</font></p> </body> </html>
Post a new question if this isn't what you meant or if you want more information.<html> <head> <title>title</title> </head> <body> <p>one two three</p> <p><img src="four.gif" alt="img"> five</p> <p>six</p> </body> </html>
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Simplify HTML programatically
by Anonymous Monk on Nov 27, 2007 at 14:08 UTC | |
by wfsp (Abbot) on Nov 27, 2007 at 14:41 UTC | |
by nic_tester (Initiate) on Nov 27, 2007 at 14:49 UTC | |
by Anonymous Monk on Nov 27, 2007 at 14:36 UTC |
In Section
Seekers of Perl Wisdom