'grouping' substrings?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all!
I have a string :

$seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOO
+OOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIM
+MMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOO
+OOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMM
+MMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOO
+OMMMMMMMMI";
[download]

and I want to find all groups of MMMMMM. I don't want to find only every position in the string that has 'M', i.e. pos6, pos7, pos8, pos9 etc but I want to have something like:
1st group : pos 7-15
2nd group : pos 23-34
3rd group : pos 45-55
etc
How can this be done?

Comment on 'grouping' substrings? Download Code

Replies are listed 'Best First'.
Re: 'grouping' substrings? by japhy (Canon) on Feb 01, 2006 at 15:59 UTC
I'd make use of Perl's `@-` and `@+` arrays produces by regexes: `my $seq = "..."; my @groups; push @groups, [$-[0], $+[0]] while $seq =~ /M+/g; print "$_->[0] to $_->[1]\n" for @groups;` [download] This gives me different values than you've shown, but I believe it's correct. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l] [select]
Re^2: 'grouping' substrings? by Anonymous Monk on Feb 01, 2006 at 16:15 UTC
Yes, don't mind what I wrote, it was just an example... will try your code ASAP. Will also check index function.. thanx to both of you!	[reply]
Re^3: 'grouping' substrings? by Anonymous Monk on Feb 01, 2006 at 16:21 UTC
Sorry to bother you again, but it doesn't seem to work. For example, the first group gives 5-16, while it should be 5-15, the second 32-45, while it should be 32-44 from what I can calculate... Are my maths poor??? Also, I can't understand what [$-[0], $+[0]] mean... Any tips ? Sorry, I'm just beggining Perl...	[reply]
Re^4: 'grouping' substrings? by kwaping (Priest) on Feb 01, 2006 at 16:25 UTC
Re^5: 'grouping' substrings? by ikegami (Patriarch) on Feb 01, 2006 at 17:02 UTC
Re^4: 'grouping' substrings? by ikegami (Patriarch) on Feb 01, 2006 at 17:04 UTC
Re: 'grouping' substrings? by kwaping (Priest) on Feb 01, 2006 at 15:54 UTC
You might like Perl's index function.	[reply]
Re^2: 'grouping' substrings? by ikegami (Patriarch) on Feb 01, 2006 at 16:50 UTC
Using index would look something like the following: `sub using_index { our $seq; *seq = \$_[0]; my @groups; my $pos = -1; my $start = -1; for (;;) { my $new_pos = index($seq, 'M', $pos+1); if ($new_pos < 0) { if (defined($start)) { push(@groups, [ $start, $pos ]); } last; } if ($start < 0) { $start = $new_pos; } elsif ($new_pos - $pos > 1) { push(@groups, [ $start, $pos ]); $start = $new_pos; } $pos = $new_pos; } return @groups; }` [download] It would be simpler if there was a function that returned the next character which isn't 'M'. As you can guess, it's much slower than the regexp approach. The regexp approach is 170% faster than (i.e. 2.7 times the speed of) the index method on the input you provided. Benchmark code: Read more... (2 kB) Benchmark results: Read more... (344 Bytes)	[reply] [d/l] [select]
Re: 'grouping' substrings? by murugu (Curate) on Feb 01, 2006 at 16:29 UTC
I have come up with this. I dont know whether using $& is effecient or not. `use strict; use warnings; my $seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOO +OOO +OOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIM +MMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOO +OOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMM +MMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOO +OMMMMMMMMI"; while ($seq=~/(M+)/g) { my $l = pos($seq); print $l-length($&)+1," to ",$l,$/; }` [download] Regards, Murugesan Kandasamy use perl for(;;);	[reply] [d/l]
Re^2: 'grouping' substrings? by revdiablo (Prior) on Feb 01, 2006 at 16:42 UTC
I dont know whether using $& is effecient or not. Using $& is not efficient, and usually to be avoided. See the entry in perlvar for details. Update: perhaps I was a bit imprecise. I'd say something that "imposes a considerable performance penalty on all regular expression matches" is inefficient, but I guess it depends on what type of inefficiency we're talking about.	[reply]
Re^3: 'grouping' substrings? by ikegami (Patriarch) on Feb 01, 2006 at 16:52 UTC
That's not exactly true. `$&` is only inefficient if you have another regexp in your program which doesn't capture. However, it's use is discouraged, since captures can perform the same task without the "effect at a distance" of `$&`. In this case, just replace `$&` with `$1`, and you're set.	[reply] [d/l] [select]
Re: 'grouping' substrings? by Cristoforo (Curate) on Feb 02, 2006 at 02:13 UTC
Having the luxury of time to consider it ;-) , here was my approach using index. `my @pos; my $start = index($str, 'M'); while ($start != -1) { my $pos; my $i = 0; 1 while ($pos = index($str, 'M', $start + $i)) == $start + $i++; push @pos, [$start, $start + $i-2]; $start = $pos; }` [download] Update - fix three lines `my $i = 1; $i++ while ($pos = index($str, 'M', $start + $i)) == $start + $i; push @pos, [$start, $start + $i-1];` [download]	[reply] [d/l] [select]
Re: 'grouping' substrings? by ysth (Canon) on Feb 02, 2006 at 00:30 UTC
Others seem to have interpreted this as "find all groups of one or more M's". On the off chance that you actually meant 6 or more M's, try this modification of japhy's solution: `my $seq = "..."; my @groups; push @groups, [$-[0], $+[0]-1] while $seq =~ /M{6,}/g; print "$_->[0] to $_->[1]\n" for @groups;` [download] (where the displayed positions are 0-based.)	[reply] [d/l]
Re: 'grouping' substrings? by Skeeve (Parson) on Feb 02, 2006 at 12:56 UTC
TIMTOWTDI: $seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOO +OOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIM +MMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOO +OOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMM +MMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOO +OMMMMMMMMI"; $i=0; $_= $seq; s/([^M])(M)/{ my $j=$i+length($1); $i=$j+length($2); $j==$i ? "" : +"pos $j-$i\n" }/ge; print $seq, "\n", $_, "\n"; [download] `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom