Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

(jcwren) Re: Text::Balanced woes..

by jcwren (Prior)
on May 27, 2002 at 02:01 UTC ( [id://169457]=note: print w/replies, xml ) Need Help??


in reply to Text::Balanced woes..

It appears that Text::Balanced does not cope with leading non-white space characters that are not balanced tag pairs.

The example below works as advertised. However, put ANY leading character or word in front of the opening <B>, and it ceases working. This doesn't seem terribly useful, unless you know you're parsing complete HTML.

Note that in a list context, a valid parsing returns 6 items. See the docs for which element is which.

#!/usr/bin/perl -w use Text::Balanced qw (extract_tagged); use strict; my $text = " <B>for</B> some trailing text"; my @a = extract_tagged ($text); print scalar (@a), "\n"; print "$_\n" for (@a); exit 0;

--Chris

e-mail jcwren

Replies are listed 'Best First'.
Re: (jcwren) Re: Text::Balanced woes..
by u914 (Pilgrim) on May 27, 2002 at 03:05 UTC
    I see...
    thank you very much, Chris.... i never would have though to try a test excluding everything before the tag...

    You're right that Text::Balanced isn't very useful like this, i can't imagine that the author meant it to be this way.

    You can see what i'm trying to do (well, actually i'll be parsing a href links out in an effort to combat chatroom spambots), is there another method you'd suggest?

    It was looking at the docs that convinced me that Text::Balanced was the right thing for me... i'd be using the 5th (#4) element.. i haven't found another lib that'll supply just the stripped URL inside the tag yet... and while Perl seems super-cool for text handling (i'm a duffer), i'd rather not rewrite the wheel..

    in any case, thanks very much for your reply!

      There are several packages based on HTML::Parser, such as HTML::LinkExtor, that shouldn't require you to invent too many wheels. I would take a look at that.

      I would avoid at all costs attempting to use a regular expression to attempt to extract links. That's just a path to problems.

      --Chris

      e-mail jcwren

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://169457]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-19 22:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found