Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Regex related question

by Hena (Friar)
on Aug 08, 2011 at 06:39 UTC ( [id://919151]=perlquestion: print w/replies, xml ) Need Help??

Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I need to do a script which trims ends of strings if there is too many same letters at the end of the string. If there is 3 or more same letters in the end, I would like to shorten the string to have only two same letters. My initial version of doing this is below.
#!/usr/bin/perl use warnings; use strict; while (<DATA>) { chomp; print "$_ -> "; m/(\w)\Z/; my $last = $1; if (s/($last{3,})\Z/$last$last/) { my $len = length($1)-2; print "$len -> $_\n"; } else { print "\n" } } __DATA__ ACTGCTAGGGGGGG TCAGCTAGCNA ACTGSCGACAAAA GTCTGAGTTATTT
And the result of it is
ACTGCTAGGGGGGG -> 5 -> ACTGCTAGG TCAGCTAGCNA -> ACTGSCGACAAAA -> 2 -> ACTGSCGACAA GTCTGAGTTATTT -> 1 -> GTCTGAGTTATT
The prints are there for sake of understanding the situation. However I do need the length of the trimmed string. Though I can of course just use length() after trimming to get it.

However I'm wondering if there is a better, or should I say faster, way of doing this. My inputs are several GB long and I would like to avoid a regex which needs to be recompiled at every time it is used (which it now does as there is a variable in the substitution). I thought of using the (??) construct but since I would need to "go back" after finding out the last character I couldn't come up with a workable solution.

Replies are listed 'Best First'.
Re: Regex related question
by Perlbotics (Archbishop) on Aug 08, 2011 at 07:09 UTC

    I have not benchmarked this, but it is probably faster than your current version. HTH

    use warnings; use strict; while (<DATA>) { chomp; print "$_ -> "; my $len = length; if ( s/(\w)\1{2,}\Z/$1$1/ ) { $len -= length; print "$len -> $_"; } print "\n"; } __DATA__ ACTGCTAGGGGGGG TCAGCTAGCNA ACTGSCGACAAAA GTCTGAGTTATTT

    Update: davido encouraged me to present my alternative version. I converted the original array based variation into a string based one, hoping (not measured) for better speed. So, in the sense of TIMTOWTDI:

    use strict; use warnings; my $in = shift || 'GATATTTTTTT'; $_ = $in; my $last = substr $_, -1; if ( length > 2 and substr($_, -2, 1) eq $last ) { chop while substr($_, -1) eq $last; $_ .= $last . $last; } print "in : $in\n"; print "out: $_\n";

      small improvement, avg 20% faster regexs/(\w)\1*(?=\1\1\Z)//
      Elegant and simple as it should be. Thanks :).
Re: Regex related question
by davido (Cardinal) on Aug 08, 2011 at 07:09 UTC

    This single regexp might be what you're after. It uses backreferences to make sure that you match the same thing three or more times, and then keeps only the first two occurrences. All this has to happen at the end of the string.

    s/(\w){2}\1+\Z/$1$1/;

    Update: Perlbotics solution above is it. ;) Bah, hate it when that happens!

    To get a good explanation, run the following one-liner:

    perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(qr/( \w){2}\1+\Z/)->explain();'

    Update2:

    So since I botched it, and the best s/// construct was already posted, I figured I may as well have a little fun with my walk of shame. :)

    The following is a substr approach that is more in keeping with how life was before every programming language developed Perl-envy and incorporated its own version of regular expressions (no, Perl didn't invent them, but was a big part of popularizing them). Have a look and enjoy knowing that you live in a Perlish world instead.

    use strict; use warnings; use v5.12; my $string = "abcdefggggggggggg"; my $position = length( $string ) - 1; my $find = substr $string, $position, 1; $position-- while substr( $string, $position, 1 ) eq $find; substr( $string, $position + 3, length( $string ) - ( $position + 3 ), + '') if length( $string ) - $position > 3; say $string;

    Dave

      I think there needs to be a condition so that the last substr is only run if needed. I came up with a similar coding.. If speed is of interest, then I would benchmark these substr approach vs the regex. I've found that sometimes the s/// can be slow, but the regex engine evolves all the time so benchmarking would be the only way to really know for the Perl that is being used.
      #!/usr/bin/perl -w use strict; my @strings = qw ( ACTGCTAGGGGGGG TCAGCTAGCNA ACTGSCGACAAAA GTCTGAGTTATTT); foreach my $str (@strings) { my $last_char = substr ($str,-1,1); my $cur_index = -1; while (substr ($str, --$cur_index,1) eq $last_char){} print "old: $str \n"; substr ($str,$cur_index+1,-$cur_index-3,"") if ($cur_index < 3); print "new: $str\n"; } __END__ old: ACTGCTAGGGGGGG new: ACTGCTAGG old: TCAGCTAGCNA new: TCAGCTAGCNA old: ACTGSCGACAAAA new: ACTGSCGACAA old: GTCTGAGTTATTT new: GTCTGAGTTATT

        I usually would say that the minor speed difference shouldn't matter. But all I know about genome mapping is that it's computationally intensive, so checking it out is probably a good idea.


        Dave

      This doesn't quite work. It breaks if there is duplication in end. See for example input: GCTGTGTGTGT.

      Thanks for trying though :).

        :) That darn quantifier position is a pain in the rear. ;) The best way to fix it would be to use Perlbotics solution above.

        I posted a "just for fun" version as an update.


        Dave

Re: Regex related question
by davido (Cardinal) on Aug 09, 2011 at 08:53 UTC

    Several people have commented in followups as well as messages to me that the various approaches this thread accumulated ought to be benchmarked. I already commented that I would probably be disinclined to care about how fast the various approaches are, given there's nothing particularly exciting going on (we're not dealing with traversing data structures, sorting, and so on). But given the context of what looks like DNA-munging, I do know that such tasks have a tendency to be brute-force expensive.

    Of course without the context of seeing our little snippets of code within a greater application we have no way of knowing how computationally important our code suggestions will be. Profiling is the only way to know if any of this is worthwhile or not. With that in mind I decided to have a little fun with Benchmark. And lately I've been trying to also take on opportunities to put Test::More through its paces too. So here is the wall of code that is a benchmark of all of the alternatives posted to this thread:

    On my system that produces the following mass of output (along with the benchmark results):

Re: Regex related question
by JavaFan (Canon) on Aug 08, 2011 at 08:36 UTC
    s/(.)\1\K(\1+)$// (Untested - requires 5.10). The required length would be calculated as: 2 + length($2).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://919151]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (7)
As of 2024-04-19 10:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found