http://qs321.pair.com?node_id=199449

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse this information in a more sane way, it seems that there should be a better way.
#!/usr/bin/perl -w use strict; my $var = "xxx:12345 yyy:54321 zzz:13245"; my @items = split("\:",$var); @items = split(" ",$items[1]); print "$items[0]\n";

xxx, yyy, zzz are basically completely random.

What I am looking for is a way to grab the value of xxx without having to use multiple split's.

Thanks in advance.

Replies are listed 'Best First'.
Re: split question
by Anonymous Monk on Sep 20, 2002 at 12:58 UTC
    Nevermind, I found my answer in:

    $var =~ m/.+\:(.+?)\s.+\:(.+?)\s.+\:(.+?)/;
      Greed is bad. More specifically, greedy quantifiers for some regexes are bad; they can slow things down.

      I compared your regex (called greedy) with a version using the +? nongreedy qualifier, both against your string (where both will give correct results) and a much longer string, where the non-greedy version will match the first 3 codes, and the greedy version will match the first code and the last 2.

      $ perl testGreed.pl Benchmark: running greedyLong, greedyShort, notGreedyLong, notGreedySh +ort, each for at least 3 CPU seconds... greedyLong: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 42 +264.23/s (n=127004) greedyShort: 4 wallclock secs ( 3.20 usr + 0.01 sys = 3.21 CPU) @ 8 +8692.45/s (n=284348) notGreedyLong: 4 wallclock secs ( 3.13 usr + 0.01 sys = 3.14 CPU) @ + 46018.76/s (n=144729) notGreedyShort: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) +@ 101593.68/s (n=305289) Rate greedyLong notGreedyLong greedyShort notG +reedyShort greedyLong 42264/s -- -8% -52% + -58% notGreedyLong 46019/s 9% -- -48% + -55% greedyShort 88692/s 110% 93% -- + -13% notGreedyShort 101594/s 140% 121% 15% + --
      As you can see, the non-greedy version runs considerably faster, since it doesn't wind up trying as many alternatives (a.k.a. backtracking).

      Here's the comparison code:

      Those results were with 5.6.1 on Cygwin, your results may vary.
      --
      Mike

      One way to do it with split:

      #!/usr/bin/perl -w use strict; my $var = "xxx:12345 yyy:54321 zzz:13245"; my @items = split /:\S+\s*/, $var; print"@items\n"; __END__ xxx yyy zzz
      -sauoq
      "My two cents aren't worth a dime.";
      
Re: split question
by helgi (Hermit) on Sep 20, 2002 at 13:36 UTC
    Here's one way to accomplish this, relying on the fact that assigning an array to a hash breaks it down into key, value pairs:

    #!/usr/bin/perl -w use strict; my $var = "xxx:12345 yyy:54321 zzz:13245"; my %value; %value = split /[:\ ]/,$var; for (keys %value) { print "$_\t:$value{$_}\n"; }

    This results in the following output:

    yyy :54321 xxx :12345 zzz :13245
    Regards,
    Helgi Briem
Re: split question
by George_Sherston (Vicar) on Sep 20, 2002 at 13:04 UTC
    My first instinct would be to look through CPAN and find a module that parses how you want. Do you have control over the un-parsed form of the data? Then someone else has almost certainly gone through all the thought processes you need to do.

    For the particular case you are dealing with, and if you are sure that the form of the input data will always be "space, non-space characters, colon, digits" then you could use a regex thus:
    use Data::Dumper; # to let us print out the results at the end my $var = "xxx:12345 yyy:54321 zzz:13245"; my %hash = $var =~ /(\S+):(\d+)/g; # a pattern match in array context returns a list of the matches # and hash is a list in which odd-numbered items are keys and even- # numbered items are values print Dumper(\%hash);
    The output is:
    $VAR1 = { 'yyy' => '54321', 'xxx' => '12345', 'zzz' => '13245' };
    The disadvantage of this is that it does depend on regular input and won't tell you if there is a breakdown in the input, but just spit out rubbish. Better to get a module for general use.

    § George Sherston
Re: split question
by BrowserUk (Patriarch) on Sep 20, 2002 at 19:26 UTC

    If as both your words and code imply, you are only after xxx and xxx always has length 3 then:

    print substr($var,0,3);

    is much simpler and faster than any regex.

    If the length of xxx can vary then

    print substr($var, 0, index($var, ':'));

    is almost as simple and still more efficient that a regex. (And easier to get right first time:)

    If you actually want to get xxx, yyy, and zzz and they are always length 3, then

    my $p=0; print substr( $var, ($p=index($var, ':', $p+1 ))-3,3),$/ whil +e $p > -1

    will won't quite (see below for reason and below that for correction. do the trick. Though it is a little harder to get right first time.

    The use of a regex only really comes into its own when the number, and length of the entities to be matched can both vary is a lot more efficient for even the simple cases than I thought!


    Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
      a little harder to get right first time
      hehe, I guess so. Your while condition doesn't fail until *after* we've printed out the substr using a value of -1 for $p. Therefore you get a phantom match of '324' given the sample input.

      You might also be surprised at how this benchmarks against a well crafted regex. The regex engine has some clever optimizations under the hood.

      This benchmark surprised me as well... I tossed in a sexegersolution that I thought would perform well, since we are looking for stuff in front of a known character. Anyway, it didn't perform as well as either of the other solutions, but the regex did win the race:

      Benchmark: running regexpShort, sexegeShort, substrShort, each for at +least 3 CPU seconds... regexpShort: 4 wallclock secs ( 3.28 usr + 0.00 sys = 3.28 CPU) @ 4 +6935.67/s (n=153949) sexegeShort: 5 wallclock secs ( 3.04 usr + 0.00 sys = 3.04 CPU) @ 2 +7424.67/s (n=83371) substrShort: 4 wallclock secs ( 3.05 usr + 0.00 sys = 3.05 CPU) @ 3 +1047.21/s (n=94694) Rate sexegeShort substrShort regexpShort sexegeShort 27425/s -- -12% -42% substrShort 31047/s 13% -- -34% regexpShort 46936/s 71% 51% -- Benchmark: running regexpLong, sexegeLong, substrLong, each for at lea +st 3 CPU seconds... regexpLong: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 59 +0.31/s (n=1889) sexegeLong: 4 wallclock secs ( 3.38 usr + 0.00 sys = 3.38 CPU) @ 31 +0.36/s (n=1049) substrLong: 5 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ 46 +2.14/s (n=1428) Rate sexegeLong substrLong regexpLong sexegeLong 310/s -- -33% -47% substrLong 462/s 49% -- -22% regexpLong 590/s 90% 28% --
      And here is the Benchmark code
      #!/usr/bin/perl -w use strict; use Benchmark qw(cmpthese); my $varshort = "abc:12345 def:54321 ghi:13245"; my $varlong = "$varshort " x 120; # subs sub regex { my $str = shift; my @arr = ($str =~ /(.{3}):/g); } sub substring { my $str = shift; my @arr; my $p = 0; push(@arr,substr( $str, ($p=index($str, ':', $p+1 ))-3,3)) while $p +> -1; pop(@arr); return @arr; } sub sexeger { my $str = reverse shift; my @arr = reverse map {$_ = reverse $_} ($str =~ /:(.{3})/g); } sub regexpShort { regex($varshort) } sub regexpLong { regex($varlong) } sub sexegeShort { sexeger($varshort) } sub sexegeLong { sexeger($varlong) } sub substrShort { substring($varshort) } sub substrLong { substring($varlong) } # unit tests my $rs = "@{[regexpShort()]}"; my $rl = "@{[regexpLong()]}"; my $ss = "@{[sexegeShort()]}"; my $sl = "@{[sexegeLong()]}"; my $bs = "@{[substrShort()]}"; my $bl = "@{[substrLong()]}"; die unless $rs eq $ss; die unless $rs eq $bs; die unless $rl eq $sl; die unless $rl eq $bl; # benchmark cmpthese(-3, { regexpShort => \&regexpShort, substrShort => \&substrShort, sexegeShort => \&sexegeShort, } ); cmpthese(-3, { regexpLong => \&regexpLong, substrLong => \&substrLong, sexegeLong => \&sexegeLong, } );

      -Blake

        That'll teach me to try and one-line my original solution.:(

        For what it's worth, I didn't say that the last one would be more efficient, but I did say it would work ;(.

        The original was

        #! perl -sw use strict; my $var = "xxx:12345 yyy:54321 zzz:13245"; my $p=0; do { ($p=index($var, ':', $p+1 )) > -1 and print substr( $var, $p-3,3),$/; } while ($p > -1);

        but I didn't like the double test against -1, so I tried to get rid of it. Don't know how I missed that it printed the extra one. A case of seeing what I wanted to see I guess.

        I'm not that surprised that doing the looping inside the regex engine is more efficient than at user level. I'm guessing that it makes a single pass looking for fixed anchors like the : when the /g options is used. I am surprised how much more efficient it is.

        Nice benchmark BTW. Something I need to get better at.


        Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!