Re: Strip specific html sequence

Please see Parsing HTML/XML with Regular Expressions for why it is indeed not a good idea to do this without a proper parser, especially look at the "spoiler" for lots of cases of perfectly valid HTML that will not be fun to parse with a regex. Here's an example with Mojo::DOM:

use warnings;
use strict;
use Mojo::DOM;

my $html = <<'ENDHTML';
<html><head><title>Title</title></head>
<body>
<div><div>
</div></div><div><div class="blue"></div></div>
</body>
</html>
ENDHTML

my $dom = Mojo::DOM->new($html);
$dom->find('div > div.blue')
    ->each(sub{ $_->parent->remove });
print $dom;

__END__

<html><head><title>Title</title></head>
<body>
<div><div>
</div></div>
</body>
</html>
[download]

I had a quick look at "Git for Windows", and it happens to include HTML::Parser. In the above thread, tangent showed an example with that module here, and because it's a fairly old but good module you will find lots of examples with it online as well. That Git distribution also appears to contain cpan as well, so you could try installing Mojo::DOM.

Comment on Re: Strip specific html sequence Download Code

Replies are listed 'Best First'.

Re^2: Strip specific html sequence
by koober (Novice) on Dec 10, 2017 at 17:43 UTC

That's a lot of good news to take in; I could have looked for that first, eh?. Many thanks. I get the hint and will abandon this path. I'm also late realizing that I could follow another path. The HTML is Perl generated anyway, this bad bit is generated by two separate lines, hence my supposed shortcut to clean them up afterwards. I can also investigate a look-ahead to prevent these bits being written.

[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks