Dump Text from HTML

ok, it's a little brutal, and expecially written to convert letters which belong to italian language, but it's fast, recognises images and scripts and after all it dumps plain ASCII!!! enjoy! SiG

#!/usr/bin/perl
# convert HTML to plain ASCII in a moment!
# ok, lynks does it better, but this is less than 1k!!!
# enjoy!
# baginov@hotmail.com
print "Input File:\n";
$input_file = <STDIN>;
chop($input_file);
open (INF,"< $input_file");
$input_file=~ s/\.htm?./\.txt/;
open (OUF,"> $input_file");
while ($riga=<INF>)
{
$riga =~ s/<.>//g;
$riga =~ s/<\/.>//g;
$riga =~ s/<\/(script|SCRIPT)>/\-\-\-\-\- Script \-\-\-\-\-\n/g;
$riga =~ s/<\/.+>//g;
$riga =~ s/<(img|IMG).+>/\n-----------\n\|  Image  \|\n-----------\n/g
+;
$riga =~ s/<(script|SCRIPT).+>/\-\-\-\-\- Script \-\-\-\-\-/g;
$riga =~ s/<br>/\n/g;
$riga =~ s/<.+>//g;
$riga =~ s/\&nbsp;/ /g;
$riga =~ s/\&egrave;/e\'/g;
$riga =~ s/\&agrave;/a\'/g;
$riga =~ s/\&ugrave;/u\'/g;
$riga =~ s/\&igrave;/i\'/g;
$riga =~ s/\&eacute;/e\'/g;
$riga =~ s/\&iacute;/i\'/g;
$riga =~ s/\&ograve;/o\'/g;
$riga =~ s/\&lt;/</g;
$riga =~ s/\&quot;/\"/g;
print OUF $riga;
}
close (INF);
close (OUF);
[download]

Comment on Dump Text from HTML Download Code

Replies are listed 'Best First'.
Re: Dump Text from HTML by OeufMayo (Curate) on Jul 18, 2001 at 14:07 UTC
And if you want something more reliable and a bit less brutal: (require HTML::Parser v.3 or higher) `perl -MHTML::Parser -e '$p=HTML::Parser->new(text_h=>[su b{print shift},"dtext"]);for(@ARGV){$p->parse_file($_)}' file.html` <kbd>-- my $OeufMayo = new PerlMonger::Paris({http => 'paris.mongueurs.net'});</kbd>	[reply] [d/l]
Re: Re: Dump Text from HTML by Sigmund (Pilgrim) on Jul 26, 2001 at 14:05 UTC
hello, and thanks for your reply. i'm just a perl novice, reading carefully the camel book, and i wanted to exchange portability versus elegance! i'm now tryng to optimize my code and make it work well enough to be satisfied by translating the html subset i more frequently find into pages i look at! soon i will post my results. i'm worried about giving this code to someone who hasn't the HTML::Parser module, and so i use regexp. thanks again, and bye SiG	[reply]
Re: Dump Text from HTML by davorg (Chancellor) on Jul 18, 2001 at 13:54 UTC
Parsing HTML using regexes is a very bad idea. It will only ever work on a particular subset of HTML. There are far easier ways to achieve this using HTML::Parser or its subclasses. -- <http://www.dave.org.uk> Perl Training in the UK <http://www.iterative-software.com>	[reply]
Re: Dump Text from HTML by alfie (Pilgrim) on Jul 18, 2001 at 12:50 UTC
You have some common mistakes in your script: You are matching greedy - add a ? after your +, like this: `$riga =~ s/<\/.+?>//g;` [download] Also, you assume that the opening bracket and the closing are on the same line, which isn't usually neeeded. So adding the s modifier to your substitutions would help, too. And, you forgot to substitute `>` with > :) There is lot of space for optimizing it, like using different delimeters to avoid having to escape the slash, or doing more substitutions just in one line, like the first two: `$riga =~ s!</?.>!!g;` [download] I hope you get what I mean, nice script anyway. -- use signature; signature(" So long\nAlfie");	[reply] [d/l] [select]
Re: Re: Dump Text from HTML by Sigmund (Pilgrim) on Jul 28, 2001 at 20:06 UTC
just a question: how do i parse my html code using the /s modifier if input is read line by line by the angle operator??? i mean, if i read one line using <INF> how can i expect that my script look into the following one just by using /s???? thanks bye SiG	[reply]
10x 4 reply by Sigmund (Pilgrim) on Jul 22, 2001 at 18:34 UTC
i just read in the camel book about the use of "?" !!! I'll post my progresses. 10x 4 all your comments. others advice me to use HTML:Parse, but i don't want no module at all. after all, i exchange portability with efficiency at a very excellent convenience rate!! see ya SiG	[reply]
Re: Re: Dump Text from HTML by dentargiano (Initiate) on Jul 10, 2002 at 10:10 UTC
Hi everyone As yo may see i´m a newbie in perl. I have used the "Dump text from html" code but i still have problems with some tags and other symbols that i can´t erase when i try to convert a html file to a text file. Also i have lot of space blank that i can´t optimize. Thanks. Dani	[reply]
Re: Re: Dump Text from HTML by dentargiano (Initiate) on Jul 10, 2002 at 10:22 UTC
Hi I have been using your code Dump Text from HTML, but I still have problems when I try to convert a html file to a text file. First of all, I have a lot of space blanks that I would want to optimize. Second I have some tags like <FONT or <href that I want to erase. Finally I want to erase all scripts and images. Could you send me any changes you have made or improve in your code?. Thank you for your help.	[reply]
Re: Dump Text from HTML by Anonymous Monk on Jan 13, 2009 at 17:27 UTC
	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks