This uses
HTML::TokeParser::Simple (there are many other parsers) and may help get you started. It preserves your
<BRK> 'tags', is that what you were after?
#! /usr/bin/perl
use warnings;
use strict;
use HTML::Entities;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new(
q{monk.html},
) or die qq{cant parse HTML};
open my $fh_out, q{>:utf8}, q{out.txt}
or die qq{cant open file to write};
while (my $t = $p->get_token){
if ($t->is_end_tag(q{p}) or $t->is_tag(q{br})){
print $fh_out qq{\n};
}
elsif ($t->is_text){
my $out = $t->as_is;
for ($out){
s/^\s+//;
s/\s+$//;
}
next unless $out;
print $fh_out decode_entities($out);
}
}
output (long lines snipped)
JACOBS
FŐTANÁCSNOK INDÍTVÁNYA<BRK>
Az ismertetés napja: 2005. november 17.1(1)
C‑371/03. sz. ügy
Siegfried Aulinger<BRK>
kontra<this should be left in>
Bundesrepublik Deutschland
1.<BRK> Ebben az ügyben az...
Európai Gazdasági Közösség közötti...
az embargóról szóló rendelet)(2)...
Some numeric entities appear here (in the browser), e.g.
Ő, these aren't in the file.