Re: Split tags and words nicely
by wfsp (Abbot) on Dec 28, 2006 at 13:12 UTC
|
You could consider a parser.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $str = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>};
my $p = HTML::TokeParser::Simple->new(\$str)
or die "can't parse str: $!";
my @array;
while (my $t = $p->get_token){
push @array, $t->as_is;
}
print "$_\n" for @array;
output:
---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl
<tag ref=1>
Start
<tag ref=2>
and more
</tag>
and end
</tag>
> Terminated with exit code 0.
| [reply] [d/l] [select] |
Re: Split tags and words nicely
by jettero (Monsignor) on Dec 28, 2006 at 13:10 UTC
|
Those regulars are particularly hard to do well. You need a special pattern matching gizmo (i.e., not a regexp/DFA) that counts depth — I forget the name, which you can fake using a (?{ $counter++ }) method to keep track of which tag is closing what.
Your best bet is to choose HTML::TreeBuilder — which I adore — or XML::XPath, which merlyn seems to really like. If you choose to go the treebuilder route, check out "HTML::Tree(Builder) in 6 minutes," which covers the use of the look_down() function. I had never heard of that until I read that post, since the function isn't documented well in my opinion.
| [reply] [d/l] [select] |
Re: Split tags and words nicely
by osunderdog (Deacon) on Dec 28, 2006 at 13:43 UTC
|
use strict;
use HTML::Parser;
use Data::Dumper;
my $input = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>};
print "Input: [$input]\n";
my $p = HTML::Parser->new(api_version=>3,
start_h=>[
\&startTokenHandler,
"self,tokens"
],
end_h=>[
\&endTokenHandler,
"self,tokens"
],
text_h =>[
\&textHandler,
"self,dtext"
],
);
$p->parse($input);
sub startTokenHandler
{
my $self = shift;
my $token = shift;
printf("<%s %s=%d>\n", @$token);
}
sub endTokenHandler
{
my $self = shift;
my $token = shift;
printf("</%s>\n", $token->[0]);
}
sub textHandler
{
my $self = shift;
my $text = shift;
print "$text\n";
}
sample output:
$perl sample2.pl
Input: [<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>]
<tag ref=1>
Start
<tag ref=2>
and more
</tag>
and end
</tag>
| [reply] [d/l] [select] |
Re: Split tags and words nicely
by themage (Friar) on Dec 28, 2006 at 13:18 UTC
|
Hi bwgoudey,
I think you may be looking for this:
$a=q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>};
@l=split qr{(</?tag[^>]*>)}, $a;
print join "\n", @l;
The main trick is to use () inside the regex used in split to capture the delimiters.
| [reply] [d/l] [select] |
|
Along the same lines, here's something that is completely regex and no join or split is needed. Probably could be obfuscated even more :)
$a = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>};
my @b;
$a =~ s/(<\/?tag[^>]*>)(\w*)/push @b, ($1,$2)/eg;
map {print $_,"\n"} @b;
Prints the following:
<tag ref=1>
Start
<tag ref=2>
and
</tag>
and
</tag>
Cheers!
s;;5776?12321=10609$d=9409:12100$xx;;s;(\d*);push @_,$1;eg;map{print chr(sqrt($_))."\n"} @_;
| [reply] [d/l] [select] |
|
You have a slight glitch in that you are losing any text after the space, e.g. "and more" comes out as "and". Fix:
$a =~ s/(<\/?tag[^>]*>)([\w ]*)/push @b, ($1,$2)/eg;
Also it is probably a good idea to avoid $a and $b for variable names because of their special status with regard to sort.
Cheers, JohnGG | [reply] [d/l] [select] |
|
Re: Split tags and words nicely
by johngg (Canon) on Dec 28, 2006 at 14:45 UTC
|
You can split on the boundary where tags either start or end by using look-behind and -ahead assertions. That is, look for where a tag stops and text starts or where text stops and a tag starts. This script runs with the -l flag to save having to print newlines explicitly.
#!/usr/local/bin/perl -l
#
use strict;
use warnings;
my $rxSplit = qr
{(?x)
(?<=[^<])
(?=[<])
|
(?<=[>])
(?=[^<])
};
my $html =
q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>};
my @elems = split m{$rxSplit}, $html;
print for @elems;
And the output.
<tag ref=1>
Start
<tag ref=2>
and more
</tag>
and end
</tag>
I hope this is of use. Cheers, JohnGG | [reply] [d/l] [select] |
|
q{<tag ref=1><tag ref=1a>Start<tag ref=2>and </tag><tag "ref=3">more</
+tag>and end};
Note unbalanced opens (4) and closes (2)
Leaving all else alone, output becomes:
<tag ref=1>
<tag ref=1a>
Start
<tag ref=2>
and
</tag>
<tag "ref=3">
more
</tag>
and end
... which offers no ready hint or markup or warning that the tags were mis-nested.
This is part of the reason that so many monks will advise against trying to parse the likes of .html or .xml with regexen and advocate the use of some of the modules mentioned above. | [reply] [d/l] [select] |
|
I agree completely with ww and reciprocate the ++. I am sure that a proper parser is by far the best approach for all but the very simplest and well behaved markup data. Unfortunately, I have done virtually nothing with HTML or XML as they haven't come my way in my current job. Because of that I can't post concrete examples of parser use, never having used one. I must rectify this.Cheers, JohnGG
| [reply] |
Re: Split tags and words nicely
by Anonymous Monk on Dec 28, 2006 at 21:09 UTC
|
A super-simple (and fast) way, depending on what you're doing with this array, would be:
@parts = split /[<>]/, $data;
Then when iterating through @parts, just keep in mind that (index % 2 == 1) means that part was inside angle brackets. (Your array would start with an empty string for the data you gave)
| [reply] [d/l] [select] |
Re: Split tags and words nicely
by spatterson (Pilgrim) on Jan 03, 2007 at 10:17 UTC
|
| [reply] |
Re: Split tags and words nicely
by tphyahoo (Vicar) on Jan 02, 2007 at 10:24 UTC
|
This looks like html, but maybe it's not.
If it's html, the other suggestions are good.
Otherwise, if you need to do something "regex like" but need more power than regexes can give you, the next step is to fire up Parse::RecDescent .
This should also become easier when perl6 goes production. There, you get all the powers of Parse::RecDescent bundled into the same syntactic sugar perlers are used to with =~ for regexes. | [reply] |