(jcwren) Re: Text::Balanced woes..

It appears that Text::Balanced does not cope with leading non-white space characters that are not balanced tag pairs.

The example below works as advertised. However, put ANY leading character or word in front of the opening <B>, and it ceases working. This doesn't seem terribly useful, unless you know you're parsing complete HTML.

Note that in a list context, a valid parsing returns 6 items. See the docs for which element is which.

#!/usr/bin/perl -w

use Text::Balanced qw (extract_tagged);
use strict;

my $text = "  <B>for</B> some trailing text";
my @a = extract_tagged ($text);

print scalar (@a), "\n";
print "$_\n" for (@a);

exit 0;
[download]

--Chris

e-mail jcwren

Comment on (jcwren) Re: Text::Balanced woes.. Download Code

Replies are listed 'Best First'.
Re: (jcwren) Re: Text::Balanced woes.. by u914 (Pilgrim) on May 27, 2002 at 03:05 UTC
I see... thank you very much, Chris.... i never would have though to try a test excluding everything before the tag... You're right that Text::Balanced isn't very useful like this, i can't imagine that the author meant it to be this way. You can see what i'm trying to do (well, actually i'll be parsing a href links out in an effort to combat chatroom spambots), is there another method you'd suggest? It was looking at the docs that convinced me that Text::Balanced was the right thing for me... i'd be using the 5th (#4) element.. i haven't found another lib that'll supply just the stripped URL inside the tag yet... and while Perl seems super-cool for text handling (i'm a duffer), i'd rather not rewrite the wheel.. in any case, thanks very much for your reply!	[reply]
(jcwren) Re: Text::Balanced woes.. by jcwren (Prior) on May 27, 2002 at 03:12 UTC
There are several packages based on HTML::Parser, such as HTML::LinkExtor, that shouldn't require you to invent too many wheels. I would take a look at that. I would avoid at all costs attempting to use a regular expression to attempt to extract links. That's just a path to problems. --Chris e-mail jcwren	[reply]


We don't bite newbies here... much
	PerlMonks