Regex related question

Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I need to do a script which trims ends of strings if there is too many same letters at the end of the string. If there is 3 or more same letters in the end, I would like to shorten the string to have only two same letters. My initial version of doing this is below.

#!/usr/bin/perl

use warnings;
use strict;

while (<DATA>) {
  chomp;
  print "$_ -> ";
  m/(\w)\Z/;
  my $last = $1;
  if (s/($last{3,})\Z/$last$last/) {
    my $len = length($1)-2;
    print "$len -> $_\n";
  } else { print "\n" }
}

__DATA__
ACTGCTAGGGGGGG
TCAGCTAGCNA
ACTGSCGACAAAA
GTCTGAGTTATTT
[download]

And the result of it is

ACTGCTAGGGGGGG -> 5 -> ACTGCTAGG
TCAGCTAGCNA -> 
ACTGSCGACAAAA -> 2 -> ACTGSCGACAA
GTCTGAGTTATTT -> 1 -> GTCTGAGTTATT
[download]

The prints are there for sake of understanding the situation. However I do need the length of the trimmed string. Though I can of course just use length() after trimming to get it.

However I'm wondering if there is a better, or should I say faster, way of doing this. My inputs are several GB long and I would like to avoid a regex which needs to be recompiled at every time it is used (which it now does as there is a variable in the substitution). I thought of using the (??) construct but since I would need to "go back" after finding out the last character I couldn't come up with a workable solution.

Comment on Regex related question Select or Download Code

Replies are listed 'Best First'.
Re: Regex related question by Perlbotics (Archbishop) on Aug 08, 2011 at 07:09 UTC
I have not benchmarked this, but it is probably faster than your current version. HTH `use warnings; use strict; while (<DATA>) { chomp; print "$_ -> "; my $len = length; if ( s/(\w)\1{2,}\Z/$1$1/ ) { $len -= length; print "$len -> $_"; } print "\n"; } __DATA__ ACTGCTAGGGGGGG TCAGCTAGCNA ACTGSCGACAAAA GTCTGAGTTATTT` [download] Update: davido encouraged me to present my alternative version. I converted the original array based variation into a string based one, hoping (not measured) for better speed. So, in the sense of TIMTOWTDI: `use strict; use warnings; my $in = shift \|\| 'GATATTTTTTT'; $_ = $in; my $last = substr $_, -1; if ( length > 2 and substr($_, -2, 1) eq $last ) { chop while substr($_, -1) eq $last; $_ .= $last . $last; } print "in : $in\n"; print "out: $_\n";` [download]	[reply] [d/l] [select]
Re^2: Regex related question by happy.barney (Friar) on Aug 08, 2011 at 08:55 UTC
small improvement, avg 20% faster regex`s/(\w)\1*(?=\1\1\Z)//`	[reply] [d/l]
Re^2: Regex related question by Hena (Friar) on Aug 08, 2011 at 07:24 UTC
Elegant and simple as it should be. Thanks :).	[reply]
Re: Regex related question by davido (Cardinal) on Aug 08, 2011 at 07:09 UTC
This single regexp might be what you're after. It uses backreferences to make sure that you match the same thing three or more times, and then keeps only the first two occurrences. All this has to happen at the end of the string. `s/(\w){2}\1+\Z/$1$1/;` Update: Perlbotics solution above is it. ;) Bah, hate it when that happens! To get a good explanation, run the following one-liner: `perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(qr/( \w){2}\1+\Z/)->explain();'` [download] Update2: So since I botched it, and the best s/// construct was already posted, I figured I may as well have a little fun with my walk of shame. :) The following is a substr approach that is more in keeping with how life was before every programming language developed Perl-envy and incorporated its own version of regular expressions (no, Perl didn't invent them, but was a big part of popularizing them). Have a look and enjoy knowing that you live in a Perlish world instead. `use strict; use warnings; use v5.12; my $string = "abcdefggggggggggg"; my $position = length( $string ) - 1; my $find = substr $string, $position, 1; $position-- while substr( $string, $position, 1 ) eq $find; substr( $string, $position + 3, length( $string ) - ( $position + 3 ), + '') if length( $string ) - $position > 3; say $string;` [download] Dave	[reply] [d/l] [select]
Re^2: Regex related question by Marshall (Canon) on Aug 08, 2011 at 08:22 UTC
I think there needs to be a condition so that the last substr is only run if needed. I came up with a similar coding.. If speed is of interest, then I would benchmark these substr approach vs the regex. I've found that sometimes the s/// can be slow, but the regex engine evolves all the time so benchmarking would be the only way to really know for the Perl that is being used. #!/usr/bin/perl -w use strict; my @strings = qw ( ACTGCTAGGGGGGG TCAGCTAGCNA ACTGSCGACAAAA GTCTGAGTTATTT); foreach my $str (@strings) { my $last_char = substr ($str,-1,1); my $cur_index = -1; while (substr ($str, --$cur_index,1) eq $last_char){} print "old: $str \n"; substr ($str,$cur_index+1,-$cur_index-3,"") if ($cur_index < 3); print "new: $str\n"; } __END__ old: ACTGCTAGGGGGGG new: ACTGCTAGG old: TCAGCTAGCNA new: TCAGCTAGCNA old: ACTGSCGACAAAA new: ACTGSCGACAA old: GTCTGAGTTATTT new: GTCTGAGTTATT [download]	[reply] [d/l]
Re^3: Regex related question by davido (Cardinal) on Aug 08, 2011 at 08:28 UTC
I usually would say that the minor speed difference shouldn't matter. But all I know about genome mapping is that it's computationally intensive, so checking it out is probably a good idea. Dave	[reply]
Re^2: Regex related question by Hena (Friar) on Aug 08, 2011 at 07:26 UTC
This doesn't quite work. It breaks if there is duplication in end. See for example input: GCTGTGTGTGT. Thanks for trying though :).	[reply]
Re^3: Regex related question by davido (Cardinal) on Aug 08, 2011 at 07:32 UTC
:) That darn quantifier position is a pain in the rear. ;) The best way to fix it would be to use Perlbotics solution above. I posted a "just for fun" version as an update. Dave	[reply]
Re: Regex related question by davido (Cardinal) on Aug 09, 2011 at 08:53 UTC
Several people have commented in followups as well as messages to me that the various approaches this thread accumulated ought to be benchmarked. I already commented that I would probably be disinclined to care about how fast the various approaches are, given there's nothing particularly exciting going on (we're not dealing with traversing data structures, sorting, and so on). But given the context of what looks like DNA-munging, I do know that such tasks have a tendency to be brute-force expensive. Of course without the context of seeing our little snippets of code within a greater application we have no way of knowing how computationally important our code suggestions will be. Profiling is the only way to know if any of this is worthwhile or not. With that in mind I decided to have a little fun with Benchmark. And lately I've been trying to also take on opportunities to put Test::More through its paces too. So here is the wall of code that is a benchmark of all of the alternatives posted to this thread: Read more... (4 kB) On my system that produces the following mass of output (along with the benchmark results): Read more... (4 kB)	[reply] [d/l]
Re: Regex related question by JavaFan (Canon) on Aug 08, 2011 at 08:36 UTC
`s/(.)\1\K(\1+)$//` (Untested - requires 5.10). The required length would be calculated as: `2 + length($2)`.	[reply] [d/l] [select]


more useful options
	PerlMonks