Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Split tags and words nicely

by bwgoudey (Sexton)
on Dec 28, 2006 at 12:59 UTC ( [id://592032]=perlquestion: print w/replies, xml ) Need Help??

bwgoudey has asked for the wisdom of the Perl Monks concerning the following question:

I'm curious about finding a nice way to solve this problem. I have some data which is in a similar form to this:
<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>
I would like to split this into an array
1. <tag ref=1> 2. Start section 3. <tag ref=2> 4. and here is more 5. </tag> 6. and here is the end 7. </tag>
I can't think of any regular expressions to use in the split function and doing it character by character seems un-perl-ish. Any ideas?

Replies are listed 'Best First'.
Re: Split tags and words nicely
by wfsp (Abbot) on Dec 28, 2006 at 13:12 UTC
    You could consider a parser.
    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $str = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; my $p = HTML::TokeParser::Simple->new(\$str) or die "can't parse str: $!"; my @array; while (my $t = $p->get_token){ push @array, $t->as_is; } print "$_\n" for @array;
    output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl <tag ref=1> Start <tag ref=2> and more </tag> and end </tag> > Terminated with exit code 0.
Re: Split tags and words nicely
by jettero (Monsignor) on Dec 28, 2006 at 13:10 UTC
    Those regulars are particularly hard to do well. You need a special pattern matching gizmo (i.e., not a regexp/DFA) that counts depth — I forget the name, which you can fake using a (?{ $counter++ }) method to keep track of which tag is closing what.

    Your best bet is to choose HTML::TreeBuilder — which I adore — or XML::XPath, which merlyn seems to really like. If you choose to go the treebuilder route, check out "HTML::Tree(Builder) in 6 minutes," which covers the use of the look_down() function. I had never heard of that until I read that post, since the function isn't documented well in my opinion.

    -Paul

Re: Split tags and words nicely
by osunderdog (Deacon) on Dec 28, 2006 at 13:43 UTC

    Here's an example using HTML::Parser

    use strict; use HTML::Parser; use Data::Dumper; my $input = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; print "Input: [$input]\n"; my $p = HTML::Parser->new(api_version=>3, start_h=>[ \&startTokenHandler, "self,tokens" ], end_h=>[ \&endTokenHandler, "self,tokens" ], text_h =>[ \&textHandler, "self,dtext" ], ); $p->parse($input); sub startTokenHandler { my $self = shift; my $token = shift; printf("<%s %s=%d>\n", @$token); } sub endTokenHandler { my $self = shift; my $token = shift; printf("</%s>\n", $token->[0]); } sub textHandler { my $self = shift; my $text = shift; print "$text\n"; }

    sample output:

    $perl sample2.pl Input: [<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>] <tag ref=1> Start <tag ref=2> and more </tag> and end </tag>

    Hazah! I'm Employed!

Re: Split tags and words nicely
by themage (Friar) on Dec 28, 2006 at 13:18 UTC
    Hi bwgoudey,

    I think you may be looking for this:
    $a=q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; @l=split qr{(</?tag[^>]*>)}, $a; print join "\n", @l;
    The main trick is to use () inside the regex used in split to capture the delimiters.

      Along the same lines, here's something that is completely regex and no join or split is needed. Probably could be obfuscated even more :)
      $a = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; my @b; $a =~ s/(<\/?tag[^>]*>)(\w*)/push @b, ($1,$2)/eg; map {print $_,"\n"} @b;

      Prints the following:
      <tag ref=1> Start <tag ref=2> and </tag> and </tag>


      Cheers!
      s;;5776?12321=10609$d=9409:12100$xx;;s;(\d*);push @_,$1;eg;map{print chr(sqrt($_))."\n"} @_;
        You have a slight glitch in that you are losing any text after the space, e.g. "and more" comes out as "and". Fix:

        $a =~ s/(<\/?tag[^>]*>)([\w ]*)/push @b, ($1,$2)/eg;

        Also it is probably a good idea to avoid $a and $b for variable names because of their special status with regard to sort.

        Cheers,

        JohnGG

Re: Split tags and words nicely
by johngg (Canon) on Dec 28, 2006 at 14:45 UTC
    You can split on the boundary where tags either start or end by using look-behind and -ahead assertions. That is, look for where a tag stops and text starts or where text stops and a tag starts. This script runs with the -l flag to save having to print newlines explicitly.

    #!/usr/local/bin/perl -l # use strict; use warnings; my $rxSplit = qr {(?x) (?<=[^<]) (?=[<]) | (?<=[>]) (?=[^<]) }; my $html = q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>}; my @elems = split m{$rxSplit}, $html; print for @elems;

    And the output.

    <tag ref=1> Start <tag ref=2> and more </tag> and end </tag>

    I hope this is of use.

    Cheers,

    JohnGG

      I do indeed admire johngg 's regex approach (and have ++ed it), but at the same time, hesitate to walk away without pointing out that it has NO capacity to flag mis-nesting (mis-nesting by .html or .xml standards, that is) and suspect that at some point bwgoudey's input data may have an anomaly or two.

      Suppose the $html in johngg's Re: Split tags and words nicely were changed to:

      q{<tag ref=1><tag ref=1a>Start<tag ref=2>and </tag><tag "ref=3">more</ +tag>and end};
      Note unbalanced opens (4) and closes (2)

      Leaving all else alone, output becomes:

      <tag ref=1> <tag ref=1a> Start <tag ref=2> and </tag> <tag "ref=3"> more </tag> and end

      ... which offers no ready hint or markup or warning that the tags were mis-nested.

      This is part of the reason that so many monks will advise against trying to parse the likes of .html or .xml with regexen and advocate the use of some of the modules mentioned above.

        I agree completely with ww and reciprocate the ++. I am sure that a proper parser is by far the best approach for all but the very simplest and well behaved markup data. Unfortunately, I have done virtually nothing with HTML or XML as they haven't come my way in my current job. Because of that I can't post concrete examples of parser use, never having used one. I must rectify this.

        Cheers,

        JohnGG

Re: Split tags and words nicely
by Anonymous Monk on Dec 28, 2006 at 21:09 UTC

    A super-simple (and fast) way, depending on what you're doing with this array, would be:
    @parts = split /[<>]/, $data;

    Then when iterating through @parts, just keep in mind that (index % 2 == 1) means that part was inside angle brackets. (Your array would start with an empty string for the data you gave)

Re: Split tags and words nicely
by spatterson (Pilgrim) on Jan 03, 2007 at 10:17 UTC
    This looks close enough to XML that some of the XML parsing modules, such as XML::Simple should split it down.

    just another cpan module author
Re: Split tags and words nicely
by tphyahoo (Vicar) on Jan 02, 2007 at 10:24 UTC
    This looks like html, but maybe it's not.

    If it's html, the other suggestions are good.

    Otherwise, if you need to do something "regex like" but need more power than regexes can give you, the next step is to fire up Parse::RecDescent .

    This should also become easier when perl6 goes production. There, you get all the powers of Parse::RecDescent bundled into the same syntactic sugar perlers are used to with =~ for regexes.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://592032]
Approved by pKai
Front-paged by andyford
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 16:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found