Split tags and words nicely

bwgoudey has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Split tags and words nicely by wfsp (Abbot) on Dec 28, 2006 at 13:12 UTC
You could consider a parser. `#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $str = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; my $p = HTML::TokeParser::Simple->new(\$str) or die "can't parse str: $!"; my @array; while (my $t = $p->get_token){ push @array, $t->as_is; } print "$_\n" for @array;` [download] output: `---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl <tag ref=1> Start <tag ref=2> and more </tag> and end </tag> > Terminated with exit code 0.` [download]	[reply] [d/l] [select]
Re: Split tags and words nicely by jettero (Monsignor) on Dec 28, 2006 at 13:10 UTC
Those regulars are particularly hard to do well. You need a special pattern matching gizmo (i.e., not a regexp/DFA) that counts depth — I forget the name, which you can fake using a `(?{ $counter++ })` method to keep track of which tag is closing what. Your best bet is to choose HTML::TreeBuilder — which I adore — or XML::XPath, which merlyn seems to really like. If you choose to go the treebuilder route, check out "HTML::Tree(Builder) in 6 minutes," which covers the use of the `look_down()` function. I had never heard of that until I read that post, since the function isn't documented well in my opinion. -Paul	[reply] [d/l] [select]
Re: Split tags and words nicely by osunderdog (Deacon) on Dec 28, 2006 at 13:43 UTC
Here's an example using HTML::Parser use strict; use HTML::Parser; use Data::Dumper; my $input = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; print "Input: [$input]\n"; my $p = HTML::Parser->new(api_version=>3, start_h=>[ \&startTokenHandler, "self,tokens" ], end_h=>[ \&endTokenHandler, "self,tokens" ], text_h =>[ \&textHandler, "self,dtext" ], ); $p->parse($input); sub startTokenHandler { my $self = shift; my $token = shift; printf("<%s %s=%d>\n", @$token); } sub endTokenHandler { my $self = shift; my $token = shift; printf("</%s>\n", $token->[0]); } sub textHandler { my $self = shift; my $text = shift; print "$text\n"; } [download] sample output: `$perl sample2.pl Input: [<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>] <tag ref=1> Start <tag ref=2> and more </tag> and end </tag>` [download] Hazah! I'm Employed!	[reply] [d/l] [select]
Re: Split tags and words nicely by themage (Friar) on Dec 28, 2006 at 13:18 UTC
Hi bwgoudey, I think you may be looking for this: `$a=q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; @l=split qr{(</?tag[^>]*>)}, $a; print join "\n", @l;` [download] The main trick is to use `()` inside the regex used in split to capture the delimiters. TheMage Magick Source Talking Web	[reply] [d/l] [select]
Re^2: Split tags and words nicely by logie17 (Friar) on Dec 28, 2006 at 21:04 UTC
Along the same lines, here's something that is completely regex and no join or split is needed. Probably could be obfuscated even more :) `$a = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; my @b; $a =~ s/(<\/?tag[^>]>)(\w)/push @b, ($1,$2)/eg; map {print $_,"\n"} @b;` [download] Prints the following: `<tag ref=1> Start <tag ref=2> and </tag> and </tag>` [download] Cheers! s;;5776?12321=10609$d=9409:12100$xx;;s;(\d*);push @_,$1;eg;map{print chr(sqrt($_))."\n"} @_;	[reply] [d/l] [select]
Re^3: Split tags and words nicely by johngg (Canon) on Dec 29, 2006 at 10:37 UTC
You have a slight glitch in that you are losing any text after the space, e.g. "and more" comes out as "and". Fix: `$a =~ s/(<\/?tag[^>]>)([\w ])/push @b, ($1,$2)/eg;` [download] Also it is probably a good idea to avoid `$a` and `$b` for variable names because of their special status with regard to `sort`. Cheers, JohnGG	[reply] [d/l] [select]
Re^4: Split tags and words nicely by logie17 (Friar) on Dec 31, 2006 at 02:18 UTC
Re: Split tags and words nicely by johngg (Canon) on Dec 28, 2006 at 14:45 UTC
You can `split` on the boundary where tags either start or end by using look-behind and -ahead assertions. That is, look for where a tag stops and text starts or where text stops and a tag starts. This script runs with the `-l` flag to save having to `print` newlines explicitly. `#!/usr/local/bin/perl -l # use strict; use warnings; my $rxSplit = qr {(?x) (?<=[^<]) (?=[<]) \| (?<=[>]) (?=[^<]) }; my $html = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; my @elems = split m{$rxSplit}, $html; print for @elems;` [download] And the output. `<tag ref=1> Start <tag ref=2> and more </tag> and end </tag>` [download] I hope this is of use. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Split tags and words nicely by ww (Archbishop) on Dec 28, 2006 at 18:56 UTC
I do indeed admire johngg 's regex approach (and have ++ed it), but at the same time, hesitate to walk away without pointing out that it has NO capacity to flag mis-nesting (mis-nesting by .html or .xml standards, that is) and suspect that at some point bwgoudey's input data may have an anomaly or two. Suppose the $html in johngg's Re: Split tags and words nicely were changed to: `q{<tag ref=1><tag ref=1a>Start<tag ref=2>and </tag><tag "ref=3">more</ +tag>and end};` [download] Note unbalanced opens (4) and closes (2) Leaving all else alone, output becomes: `<tag ref=1> <tag ref=1a> Start <tag ref=2> and </tag> <tag "ref=3"> more </tag> and end` [download] ... which offers no ready hint or markup or warning that the tags were mis-nested. This is part of the reason that so many monks will advise against trying to parse the likes of .html or .xml with regexen and advocate the use of some of the modules mentioned above.	[reply] [d/l] [select]
Re^3: Split tags and words nicely by johngg (Canon) on Dec 28, 2006 at 19:57 UTC
I agree completely with ww and reciprocate the ++. I am sure that a proper parser is by far the best approach for all but the very simplest and well behaved markup data. Unfortunately, I have done virtually nothing with HTML or XML as they haven't come my way in my current job. Because of that I can't post concrete examples of parser use, never having used one. I must rectify this. Cheers, JohnGG	[reply]
Re: Split tags and words nicely by Anonymous Monk on Dec 28, 2006 at 21:09 UTC
A super-simple (and fast) way, depending on what you're doing with this array, would be: `@parts = split /[<>]/, $data;` Then when iterating through `@parts`, just keep in mind that `(index % 2 == 1)` means that part was inside angle brackets. (Your array would start with an empty string for the data you gave)	[reply] [d/l] [select]
Re: Split tags and words nicely by spatterson (Pilgrim) on Jan 03, 2007 at 10:17 UTC
This looks close enough to XML that some of the XML parsing modules, such as XML::Simple should split it down. just another cpan module author	[reply]
Re: Split tags and words nicely by tphyahoo (Vicar) on Jan 02, 2007 at 10:24 UTC
This looks like html, but maybe it's not. If it's html, the other suggestions are good. Otherwise, if you need to do something "regex like" but need more power than regexes can give you, the next step is to fire up Parse::RecDescent . This should also become easier when perl6 goes production. There, you get all the powers of Parse::RecDescent bundled into the same syntactic sugar perlers are used to with =~ for regexes.	[reply]


There's more than one way to do things
	PerlMonks