http://qs321.pair.com?node_id=527080

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all!
I have a string :
$seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOO +OOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIM +MMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOO +OOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMM +MMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOO +OMMMMMMMMI";

and I want to find all groups of MMMMMM. I don't want to find only every position in the string that has 'M', i.e. pos6, pos7, pos8, pos9 etc but I want to have something like:
1st group : pos 7-15
2nd group : pos 23-34
3rd group : pos 45-55
etc
How can this be done?

Replies are listed 'Best First'.
Re: 'grouping' substrings?
by japhy (Canon) on Feb 01, 2006 at 15:59 UTC
    I'd make use of Perl's @- and @+ arrays produces by regexes:
    my $seq = "..."; my @groups; push @groups, [$-[0], $+[0]] while $seq =~ /M+/g; print "$_->[0] to $_->[1]\n" for @groups;
    This gives me different values than you've shown, but I believe it's correct.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      Yes, don't mind what I wrote, it was just an example... will try your code ASAP. Will also check index function.. thanx to both of you!
        Sorry to bother you again, but it doesn't seem to work. For example, the first group gives 5-16, while it should be 5-15, the second 32-45, while it should be 32-44 from what I can calculate... Are my maths poor??? Also, I can't understand what [$-[0], $+[0]] mean... Any tips ? Sorry, I'm just beggining Perl...
Re: 'grouping' substrings?
by kwaping (Priest) on Feb 01, 2006 at 15:54 UTC
    You might like Perl's index function.

      Using index would look something like the following:

      sub using_index { our $seq; *seq = \$_[0]; my @groups; my $pos = -1; my $start = -1; for (;;) { my $new_pos = index($seq, 'M', $pos+1); if ($new_pos < 0) { if (defined($start)) { push(@groups, [ $start, $pos ]); } last; } if ($start < 0) { $start = $new_pos; } elsif ($new_pos - $pos > 1) { push(@groups, [ $start, $pos ]); $start = $new_pos; } $pos = $new_pos; } return @groups; }

      It would be simpler if there was a function that returned the next character which isn't 'M'.

      As you can guess, it's much slower than the regexp approach. The regexp approach is 170% faster than (i.e. 2.7 times the speed of) the index method on the input you provided.

      Benchmark code:

      Benchmark results:

Re: 'grouping' substrings?
by murugu (Curate) on Feb 01, 2006 at 16:29 UTC

    I have come up with this. I dont know whether using $& is effecient or not.

    use strict; use warnings; my $seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOO +OOO +OOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIM +MMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOO +OOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMM +MMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOO +OMMMMMMMMI"; while ($seq=~/(M+)/g) { my $l = pos($seq); print $l-length($&)+1," to ",$l,$/; }

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

      I dont know whether using $& is effecient or not.

      Using $& is not efficient, and usually to be avoided. See the entry in perlvar for details.

      Update: perhaps I was a bit imprecise. I'd say something that "imposes a considerable performance penalty on all regular expression matches" is inefficient, but I guess it depends on what type of inefficiency we're talking about.

        That's not exactly true. $& is only inefficient if you have another regexp in your program which doesn't capture.

        However, it's use is discouraged, since captures can perform the same task without the "effect at a distance" of $&.

        In this case, just replace $& with $1, and you're set.

Re: 'grouping' substrings?
by Cristoforo (Curate) on Feb 02, 2006 at 02:13 UTC
    Having the luxury of time to consider it ;-) , here was my approach using index.
    my @pos; my $start = index($str, 'M'); while ($start != -1) { my $pos; my $i = 0; 1 while ($pos = index($str, 'M', $start + $i)) == $start + $i++; push @pos, [$start, $start + $i-2]; $start = $pos; }
    Update - fix three lines

    my $i = 1; $i++ while ($pos = index($str, 'M', $start + $i)) == $start + $i; push @pos, [$start, $start + $i-1];
Re: 'grouping' substrings?
by ysth (Canon) on Feb 02, 2006 at 00:30 UTC
    Others seem to have interpreted this as "find all groups of one or more M's".

    On the off chance that you actually meant 6 or more M's, try this modification of japhy's solution:

    my $seq = "..."; my @groups; push @groups, [$-[0], $+[0]-1] while $seq =~ /M{6,}/g; print "$_->[0] to $_->[1]\n" for @groups;
    (where the displayed positions are 0-based.)
Re: 'grouping' substrings?
by Skeeve (Parson) on Feb 02, 2006 at 12:56 UTC
    TIMTOWTDI:
    $seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOO +OOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIM +MMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOO +OOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMM +MMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOO +OMMMMMMMMI"; $i=0; $_= $seq; s/([^M]*)(M*)/{ my $j=$i+length($1); $i=$j+length($2); $j==$i ? "" : +"pos $j-$i\n" }/ge; print $seq, "\n", $_, "\n";

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e