Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Parsing/Extracting Data from HTML.

by juahonen (Novice)
on Mar 23, 2000 at 18:34 UTC ( [id://5963]=note: print w/replies, xml ) Need Help??


in reply to Parsing/Extracting Data from HTML.

Perl can covert HTML to text too...

$htmltext =~ s/<(.*)>//g;

...will replace all tags with emptiness.

If you wish to convert br's and p's to newlines before they are stripped, add:
$htmltext =~ s/<(br|p)>/\n\n/ig;
before the first command.

Of course, you'll lose all formatting. This method is not quarenteed to properly strip comments.

Replies are listed 'Best First'.
RE: Re: Parsing/Extracting Data from HTML.
by chromatic (Archbishop) on Mar 23, 2000 at 20:48 UTC
    No, don't do that. It's too greedy:
    my $string = "<first><second>blahblah<third>\n"; $string =~ s/<(.*)>//g; print $string;
    Result: (Hey, it's blank!)

    If you really want to do it this way, use: $string =~ s/<[^>]*?>//g; The question mark keeps the asterisk from slurping up any character -- including angle brackets -- to the end of the line, and then backtracking to pick up that last angle bracket. Of course, so does the negated character class. Just be more specific.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://5963]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-03-28 18:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found