help with regex

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: help with regex
by Athanasius (Archbishop) on Mar 20, 2019 at 03:20 UTC

To capture 10 or more consecutive G characters, you don’t need (G)\1{9,}, just use (G{10,}), which is simpler and easier to read.

Note than when you have a regex of the form / .* (G{10,}) /x, the first .* is greedy and will match as much of the G-sequence as it can, so the second capture will contain only the 10 Gs it needs to satisfy the match. If you want all the Gs (15 for the sample data given), you need to make the first match non-greedy: / .*? G{10,} /x.

Your requirements are not clear (to me). Please provide the exact output you desire for the given input data (and additional lines of input together with the desired output for each). In the meantime, I’m guessing you want to find a 10-character ACTG sequence immediately following the specific sequence ACTCCAGTCACGCCAATATCTCGTAT and followed (but not necessarily immediately) by a 10+ sequence of G characters:

use 5.18.2;

while (my $line = <DATA>)
{
    say;

    if ($line =~ m/ (ACTCCAGTCACGCCAATATCTCGTAT) ([ACTG]{10}) .*? (G{1
+0,}) /x)
    {
        say for $1, $2, $3;    # Can use @{^CAPTURE} in Perl 5.25.7 an
+d later
    }
}

__DATA__
GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGAACTCCAGTCACG
+CCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGG
+GGAGGGGGGGAG
[download]

Output:

13:18 >perl 1986_SoPW.pl

ACTCCAGTCACGCCAATATCTCGTAT
GCCGTCTTCT
GGGGGGGGGGGGGGG

13:18 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: help with regex

by jwkrahn (Abbot) on Mar 20, 2019 at 04:20 UTC

If you want all the Gs (15 for the sample data given), you need to make the first match non-greedy: / .*? G{10,} /x.

Or just / G{10,} /x would be simpler and do the same thing. (Unless the string contains newlines!)

[reply]
[d/l]
[select]

Re: help with regex
by hdb (Monsignor) on Mar 20, 2019 at 08:16 UTC

split can also capture. So if you want all the pieces, the G-strings and the other bits, it could work like this

use strict;
use warnings;

my $line = "GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGA
+ACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCG
+GGGGGGGGGGGGGGAGGGGGGGAG";
my @pieces = split /(G{10,})/, $line;
print "$_\n" for @pieces;
[download]

[reply]
[d/l]

Re: help with regex
by Marshall (Canon) on Mar 20, 2019 at 05:26 UTC

As far as I can tell there is only one string that is between a G(...)and 10+ more G's.
Update: shortened the very long string so it displays better.
Your question is not clear. Your code is not right.

Given the string: "GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACA
+CGTCTGAACTCCAGTCACG".
"CCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGG
+GGAGGGGGGGAG", please explain in english what you want to accomplish.
+ I have no idea what this means
[download]

"ACTCCAGTCACGCCAATATCTCGTAT" "[ACTG]{0,10}" " .+" "(G)\1{9,} ".+ "

Update:
is this what you want?

#!/usr/bin/perl
use strict;
use warnings;

while (my $line=<DATA>)
{ 
    chomp $line;
    my @array = $line =~ /(ACTCCAGTCACGCCAATATCTCGTAT)(.+?)(?:G{10,})/
+g;
    
    print join ("\n", @array),"\n";
    #prints "ACTCCAGTCACGCCAATATCTCGTAT"
    #       "GCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGC" -> stuff befo
+re 10 or more G's

    
}

__DATA__
GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGAACTCCAGTCACG
+CCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGG
+GGAGGGGGGGAG
[download]

[reply]
[d/l]
[select]

Re: help with regex
by rnaeye (Friar) on Mar 21, 2019 at 00:49 UTC

Thank you everyone for trying to help. The closest thing to what I wanted was / .*? G{10,} /x. I somehow solved the problem based on the suggestions. If I repost the question in more detail, I am sure I would get exact solution, but that would take a lot of writing. Answers and questions were educational and helpful for me.

[reply]

Re: help with regex
by Marshall (Canon) on Mar 20, 2019 at 05:23 UTC

As far as I can tell there is only one string that is between a G(...)and 10+ more G's. Your question is not clear. Your code is not right. Given the string: "GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGGGGAGGGGGGGAG", please explain in english what you want to accomplish.

[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks