Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Dump Text from HTML

by Sigmund (Pilgrim)
on Jul 18, 2001 at 11:39 UTC ( #97570=snippet: print w/replies, xml ) Need Help??
Description: ok, it's a little brutal, and expecially written to convert letters which belong to italian language, but it's fast, recognises images and scripts and after all it dumps plain ASCII!!! enjoy! SiG
# convert HTML to plain ASCII in a moment!
# ok, lynks does it better, but this is less than 1k!!!
# enjoy!
print "Input File:\n";
$input_file = <STDIN>;
open (INF,"< $input_file");
$input_file=~ s/\.htm?./\.txt/;
open (OUF,"> $input_file");
while ($riga=<INF>)
$riga =~ s/<.>//g;
$riga =~ s/<\/.>//g;
$riga =~ s/<\/(script|SCRIPT)>/\-\-\-\-\- Script \-\-\-\-\-\n/g;
$riga =~ s/<\/.+>//g;
$riga =~ s/<(img|IMG).+>/\n-----------\n\|  Image  \|\n-----------\n/g
$riga =~ s/<(script|SCRIPT).+>/\-\-\-\-\- Script \-\-\-\-\-/g;
$riga =~ s/<br>/\n/g;
$riga =~ s/<.+>//g;
$riga =~ s/\&nbsp;/ /g;
$riga =~ s/\&egrave;/e\'/g;
$riga =~ s/\&agrave;/a\'/g;
$riga =~ s/\&ugrave;/u\'/g;
$riga =~ s/\&igrave;/i\'/g;
$riga =~ s/\&eacute;/e\'/g;
$riga =~ s/\&iacute;/i\'/g;
$riga =~ s/\&ograve;/o\'/g;
$riga =~ s/\&lt;/</g;
$riga =~ s/\&quot;/\"/g;
print OUF $riga;
close (INF);
close (OUF);
Replies are listed 'Best First'.
Re: Dump Text from HTML
by OeufMayo (Curate) on Jul 18, 2001 at 14:07 UTC

    And if you want something more reliable and a bit less brutal:
    (require HTML::Parser v.3 or higher)

    perl -MHTML::Parser -e '$p=HTML::Parser->new(text_h=>[su b{print shift},"dtext"]);for(@ARGV){$p->parse_file($_)}' file.html <kbd>--
    my $OeufMayo = new PerlMonger::Paris({http => ''});</kbd>
      hello, and thanks for your reply. i'm just a perl novice, reading carefully the camel book, and i wanted to exchange portability versus elegance! i'm now tryng to optimize my code and make it work well enough to be satisfied by translating the html subset i more frequently find into pages i look at! soon i will post my results. i'm worried about giving this code to someone who hasn't the HTML::Parser module, and so i use regexp. thanks again, and bye SiG
Re: Dump Text from HTML
by davorg (Chancellor) on Jul 18, 2001 at 13:54 UTC
Re: Dump Text from HTML
by alfie (Pilgrim) on Jul 18, 2001 at 12:50 UTC
    You have some common mistakes in your script:
    You are matching greedy - add a ? after your +, like this:
    $riga =~ s/<\/.+?>//g;
    Also, you assume that the opening bracket and the closing are on the same line, which isn't usually neeeded. So adding the s modifier to your substitutions would help, too. And, you forgot to substitute &gt; with > :)

    There is lot of space for optimizing it, like using different delimeters to avoid having to escape the slash, or doing more substitutions just in one line, like the first two:

    $riga =~ s!</?.>!!g;
    I hope you get what I mean, nice script anyway.
    use signature; signature(" So long\nAlfie");
      just a question: how do i parse my html code using the /s modifier if input is read line by line by the angle operator??? i mean, if i read one line using <INF> how can i expect that my script look into the following one just by using /s???? thanks bye SiG
      i just read in the camel book about the use of "?" !!! I'll post my progresses. 10x 4 all your comments. others advice me to use HTML:Parse, but i don't want no module at all. after all, i exchange portability with efficiency at a very excellent convenience rate!! see ya SiG
      Hi everyone As yo may see iīm a newbie in perl. I have used the "Dump text from html" code but i still have problems with some tags and other symbols that i canīt erase when i try to convert a html file to a text file. Also i have lot of space blank that i canīt optimize. Thanks. Dani
      Hi I have been using your code Dump Text from HTML, but I still have problems when I try to convert a html file to a text file. First of all, I have a lot of space blanks that I would want to optimize. Second I have some tags like <FONT or <href that I want to erase. Finally I want to erase all scripts and images. Could you send me any changes you have made or improve in your code?. Thank you for your help.
Re: Dump Text from HTML
by Anonymous Monk on Jan 13, 2009 at 17:27 UTC
    Log In?

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: snippet [id://97570]
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (1)
    As of 2022-01-26 05:40 GMT
    Find Nodes?
      Voting Booth?
      In 2022, my preferred method to securely store passwords is:

      Results (69 votes). Check out past polls.