Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

help with regex

by rnaeye (Friar)
on Mar 20, 2019 at 02:07 UTC ( [id://1231470]=perlquestion: print w/replies, xml ) Need Help??

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to catch matches in a DNA sequence. I can capture repeating G (10 or more) in DNA as below.

use 5.18.2; my $line; while (<DATA>){ $line = $_; if ($line =~ m/(G)\1{9,}/) { say "$&" } } __DATA__ GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGAACTCCAGTCACG +CCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGG +GGAGGGGGGGAG

What I want to capture more is below. In addition to 10 G, I also want to capture strings at the left of (G)\1{9,}. Note I use " " to indicate what I want to capture; it's not a part of the DNA string. I could not capture the other parts of the string in conjunction with (G)\1{9,}. I need to print what I capture.

 "ACTCCAGTCACGCCAATATCTCGTAT"  "[ACTG]{0,10}"  " .+"  "(G)\1{9,}   ".+ "

Thanks.

Replies are listed 'Best First'.
Re: help with regex
by Athanasius (Archbishop) on Mar 20, 2019 at 03:20 UTC

    Hello rnaeye,

    To capture 10 or more consecutive G characters, you don’t need (G)\1{9,}, just use (G{10,}), which is simpler and easier to read.

    Note than when you have a regex of the form / .* (G{10,}) /x, the first .* is greedy and will match as much of the G-sequence as it can, so the second capture will contain only the 10 Gs it needs to satisfy the match. If you want all the Gs (15 for the sample data given), you need to make the first match non-greedy: / .*? G{10,} /x.

    Your requirements are not clear (to me). Please provide the exact output you desire for the given input data (and additional lines of input together with the desired output for each). In the meantime, I’m guessing you want to find a 10-character ACTG sequence immediately following the specific sequence ACTCCAGTCACGCCAATATCTCGTAT and followed (but not necessarily immediately) by a 10+ sequence of G characters:

    use 5.18.2; while (my $line = <DATA>) { say; if ($line =~ m/ (ACTCCAGTCACGCCAATATCTCGTAT) ([ACTG]{10}) .*? (G{1 +0,}) /x) { say for $1, $2, $3; # Can use @{^CAPTURE} in Perl 5.25.7 an +d later } } __DATA__ GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGAACTCCAGTCACG +CCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGG +GGAGGGGGGGAG

    Output:

    13:18 >perl 1986_SoPW.pl ACTCCAGTCACGCCAATATCTCGTAT GCCGTCTTCT GGGGGGGGGGGGGGG 13:18 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      If you want all the Gs (15 for the sample data given), you need to make the first match non-greedy: / .*? G{10,} /x.

      Or just / G{10,} /x would be simpler and do the same thing. (Unless the string contains newlines!)

Re: help with regex
by hdb (Monsignor) on Mar 20, 2019 at 08:16 UTC

    split can also capture. So if you want all the pieces, the G-strings and the other bits, it could work like this

    use strict; use warnings; my $line = "GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGA +ACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCG +GGGGGGGGGGGGGGAGGGGGGGAG"; my @pieces = split /(G{10,})/, $line; print "$_\n" for @pieces;
Re: help with regex
by Marshall (Canon) on Mar 20, 2019 at 05:26 UTC
    Please post your desired output from your example.
    Better yet is to show 2-3 examples and the desired output for each of them.

    As far as I can tell there is only one string that is between a G(...)and 10+ more G's.
    Update: shortened the very long string so it displays better.
    Your question is not clear. Your code is not right.

    Given the string: "GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACA +CGTCTGAACTCCAGTCACG". "CCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGG +GGAGGGGGGGAG", please explain in english what you want to accomplish. + I have no idea what this means

    "ACTCCAGTCACGCCAATATCTCGTAT"  "[ACTG]{0,10}"  " .+"  "(G)\1{9,}   ".+ "
    or how that relates to 10 G's or more in a row.

    Update:
    is this what you want?

    #!/usr/bin/perl use strict; use warnings; while (my $line=<DATA>) { chomp $line; my @array = $line =~ /(ACTCCAGTCACGCCAATATCTCGTAT)(.+?)(?:G{10,})/ +g; print join ("\n", @array),"\n"; #prints "ACTCCAGTCACGCCAATATCTCGTAT" # "GCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGC" -> stuff befo +re 10 or more G's } __DATA__ GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGAACTCCAGTCACG +CCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGG +GGAGGGGGGGAG
Re: help with regex
by rnaeye (Friar) on Mar 21, 2019 at 00:49 UTC

    Thank you everyone for trying to help. The closest thing to what I wanted was / .*? G{10,} /x. I somehow solved the problem based on the suggestions. If I repost the question in more detail, I am sure I would get exact solution, but that would take a lot of writing. Answers and questions were educational and helpful for me.

Re: help with regex
by Marshall (Canon) on Mar 20, 2019 at 05:23 UTC
    Please post your desired output from your example.
    Better yet is to show 2-3 examples and the desired output for each of them.

    As far as I can tell there is only one string that is between a G(...)and 10+ more G's. Your question is not clear. Your code is not right. Given the string: "GGCTTTCCGTTGTTGCTGGGTGTGGGGGGCGGGCGAGATTGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGTGGGGGGGAGGGGGGGCGGGGGGGGGGGGGGGAGGGGGGGAG", please explain in english what you want to accomplish.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1231470]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-19 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found