Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: Converting HTML to txt with HTML::Strip

by wfsp (Abbot)
on Oct 03, 2010 at 13:43 UTC ( [id://863178]=note: print w/replies, xml ) Need Help??


in reply to Converting HTML to txt with HTML::Strip

This uses HTML::TokeParser::Simple (there are many other parsers) and may help get you started. It preserves your <BRK> 'tags', is that what you were after?
#! /usr/bin/perl use warnings; use strict; use HTML::Entities; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( q{monk.html}, ) or die qq{cant parse HTML}; open my $fh_out, q{>:utf8}, q{out.txt} or die qq{cant open file to write}; while (my $t = $p->get_token){ if ($t->is_end_tag(q{p}) or $t->is_tag(q{br})){ print $fh_out qq{\n}; } elsif ($t->is_text){ my $out = $t->as_is; for ($out){ s/^\s+//; s/\s+$//; } next unless $out; print $fh_out decode_entities($out); } }
output (long lines snipped)
JACOBS F&#336;TANÁCSNOK INDÍTVÁNYA<BRK> Az ismertetés napja: 2005. november 17.1(1) C&#8209;371/03. sz. ügy Siegfried Aulinger<BRK> kontra<this should be left in> Bundesrepublik Deutschland 1.<BRK>        Ebben az ügyben az... Európai Gazdasági Közösség közötti... az embargóról szóló rendelet)(2)...
Some numeric entities appear here (in the browser), e.g. &#336;, these aren't in the file.

Replies are listed 'Best First'.
Re^2: Converting HTML to txt with HTML::Strip
by elef (Friar) on Oct 04, 2010 at 16:08 UTC
    Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in).
    Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general.
    I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő.
    Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it.
    By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc.
    I.e. instead of the 20 or so lines you and I posted, it should be
    #! /usr/bin/perl use warnings; use strict; use HTML::Convert; HTML::Convert(file.html);
    ... and you'd get file.txt created in the same folder.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://863178]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-23 06:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found