http://qs321.pair.com?node_id=812874

Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I'm trying to clean up a string which has some extra information in it for my situation. The string comes from a Samtools pileup. It's a representation of a single nucleotide from a multiple different sources (base reads column). My data in problem is following (note new lines added in order to prevent line being too long, they aren't really there :) )

.$.$G$.$.$C$.$.$.$G$.$,$,$,$.$.$.G,G,G...G,G...GG,.G.G...G.G....$.G..G +,........G....,GGGG.,,,..G...,,.G.G..G..G..G .G.GG..GG.G..,,G,,CG,G,GG..GG.$G.GGGG,,..GG...G.,G.GG.,G,G.$...,,.GGGG +GG.GCG..G,,G,.G..G,,,G,.GGGG.,..G...,,,,G,,G..GGGG A.,,,,,.+1GG.G.,,G,...G..GG,.G....+1G..GG..G,,G,,G.G,,.,,,.,,.CG.,,,,. +,..G.,,,.,.,,GGGGGG,,.....G..GGGGG.,.G,,GG.G..GG,, ,....,.,..,G.,.,,,.,,,,G,,,.,.,..,.,,,...GG,.,G.,G......,,,..,,....... +.,..,.,,.,...,,..,.,C,..,,,.,,,,,....,,..,,,.,.... .,.,,.,...,,.,,,-1a.,,,,,.,,,,,,..,..........,,,,,.,...,,.,,^],^].^],^ +],^].^],^F,^],^],^],^],


Now I want to do counting on bases and thus remove all extra information from it. I have regex like following s/\+([0-9]+)[ACGTNacgtn]{\1}//g; which unfortunately does't do anything. I'd like to know why? I would assume that quantifier cannot be a '\x' variable, but don't really know.

I can do the deed with
while (m/\+([0-9]+)[ACGTNacgtn]/g) { print "diff+: $1\n"; my $m = $1; s/\+[0-9]+[ACGTNacgtn]{$m}// }
But that's not quite so nice. I would be cool to be able to do it with one regex.

Help most appreciated,

Edit: Fixed the capture group in place where it should have been :).

Replies are listed 'Best First'.
Re: Regex fun
by JavaFan (Canon) on Dec 15, 2009 at 14:26 UTC
    You cannot have a backreference as a quantifier. (Not that you have a capture group to reference to). If you want to do something like that, you may want to try:
    s/\+([0-9]+)(??{ "[ACGTNacgtn]{$1}" })//g;
      So it was as I feared.

      The code looks interesting. I suppose it's either this or what I posted earlier. I wonder if using (??{...}) hurts readability, though mine isn't much better I suppose.
Re: Regex fun
by moritz (Cardinal) on Dec 15, 2009 at 14:28 UTC
    which unfortunately does't do anything. I'd like to know why?
    Three things are wrong: First of all there's no digit in you input, so [0-9] can't match. Secondly you don't capture the result from [0-9]+, so matching against \1 is bound to fail. Thirdly you can't use a capture as a quantifier, as you already guessed.

    Update: Just found that there are actually some 1s in the input. If all sequences are short, you can just generate them:

    my $re = join '|', map { qr/\+$_[ACGTNacgtn]{$_}/} 1..20; s/$re//g;

    Should still be fairly efficient, especially with perl-5.10, which optimizes regex alternations with constant prefixes.

    Second update: Added a missing \+ JadeNB++ for noticing.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Regex fun
by AnomalousMonk (Archbishop) on Dec 15, 2009 at 15:43 UTC

    Here's an approach that avoids the scary  (?{ code }) and  (??{ code }) regex constructs, but whether it's more readable than your  while loop is another question.

    >perl -wMstrict -le "my $s = '_+0GAA__+1GAA__+2GGAA__+3GGGAA_'; print qq{'$s'}; $s =~ s{ ( \+ (\d+) [ACGT]+ ) } { (my $r = $1) =~ s{ \+ \d+ [ACGT]{$2} }{}xms; $r }xmsge; print qq{'$s'}; " '_+0GAA__+1GAA__+2GGAA__+3GGGAA_' '_GAA__AA__AA__AA_'

    Note that the  '+0ACGT' sequence is handled in a way that seems consistent with your original regex, but that I don't know to be correct.

    Update: Here's a version using substr that might be a bit more readable.

    >perl -wMstrict -le "my $s = '_+0GAA__+1GAA__+2GGAA__+3GGGAA_'; print qq{'$s'}; $s =~ s{ ( \+ (\d+) [ACGT]+ ) } { substr $1, 1 + length($2) + $2 }xmsge; print qq{'$s'}; " '_+0GAA__+1GAA__+2GGAA__+3GGGAA_' '_GAA__AA__AA__AA_'
Re: Regex fun
by JadeNB (Chaplain) on Dec 15, 2009 at 17:58 UTC
    I can do the deed with
    while (m/\+([0-9]+)[ACGTNacgtn]/g) { print "diff+: $1\n"; my $m = $1; s/\+[0-9]+[ACGTNacgtn]{$m}// }
    But that's not quite so nice.

    I understand that you meant “It's not so nice because I'd like a single regex”, but it's also not so nice because you're doing some work (matching the number and a single base) twice. You might prefer something like

    s/\G\+$1[$bases]{$1}// while /(?=\+([0-9]+))/g;
    which is at least 1 line, if 2 regexes. :-)

    Here's a fairly naughty single-regex approach:

    1 while s/(?<=\+)([0-9]+)[$bases]/$1 - 1/eg;

    UPDATE: Changed the patterns to use look-around. Your version and my first will both loop forever on a mal-formed string like +2G, whereas the second one will just reduce it to +1 and terminate.
    UPDATE: As Hena points out, I forgot a base case in my induction! The following fixes it (at least if no +0 strings are allowed), but loses a lot of the fun:

    1 while s/\+([0-9]+)[aAgGcC]/$1 > 1 ? '+' . $1 - 1 : ''/eg;

      I would like single regex as, as was pointed out, the double regex is wasteful since the matching has to be done twice. Also I think it would be better from readability point of view :).

      However you example 1 while s/(?<=\+)([0-9]+)[$bases]/$1 - 1/eg; doesn't quite work. I'm trying to remove all the numbers and the associated bases for it. Running yours left strange things in it (+-1 and +0 below).
      ..G..C...G.,,,...G,G,G...G,G...GG,.G.G...G.G.....G..G,........G....,GG +GG.,,,..G...,,.G.G..G..G..G.G.GG..GG.G..,,G, ,CG,G,GG..GG.G.GGGG,,..GG...G.,G.GG.,G,G....,,.GGGGGG.GCG..G,,G,.G..G, +,,G,.GGGG.,..G...,,,,G,,G..GGGGA.,,,,,.+-1.G.,,G,. ..G..GG,.G....+0..GG..G,,G,,G.G,,.,,,.,,.CG.,,,,.,..G.,,,.,.,,GGGGGG,, +.....G..GGGGG.,.G,,GG.G..GG,,,....,.,..,G.,.,,,.,, ,,G,,,.,.,..,.,,,...GG,.,G.,G......,,,..,,........,..,.,,.,...,,..,.,C +,..,,,.,,,,,....,,..,,,.,.....,.,,.,...,,.,,,-1a.,,,,,.,,,,,,..,..... +.....,,,,,.,...,,.,,,.,,.,,,,,,


      Edit: I think adding /c in the match pattern should fix the problem with my while and possibly the (??{...}) version as well?
        I would like single regex as, as was pointed out, the double regex is wasteful since the matching has to be done twice.

        The 2-regex version that I proposed avoids a lot of the double matching (it converts 2 number searches into 1 number search and then a hunt for a literal string). However, only benchmarking (which I'm too lazy to do) will show whether it's actually faster.

        If s/// set pos (and behaved like m// in a while loop), then one could avoid any doubled effort at all:

        s/\G\+$1[$bases]{$1}// while s/\+([0-9]+)//g;
        (UPDATE: but note that this is fanciful, non-working code). In fact, that's what I had originally, until I tested it and discovered that it didn't work.

        Running yours left strange things in it (+-1 and +0 below).

        Oops, sorry! My original regex wasn't smart enough to stop grabbing bases once the count indicated that there were no more needed. I've fixed it, but, since the only point was its brevity (it's frightfully inefficient), it's not much fun any more.

        I think adding /c in the match pattern should fix the problem with my while and possibly the (??{...}) version as well?

        I'm not sure what you mean. I am pretty sure (but can't find the documentation) that /c only has an effect on the semantics of failed matches, and those aren't our problem here. Notice that (??{ }) doesn't have a problem to be fixed—the application you have in mind is essentially exactly for what that escape was designed (I assume), and it doesn't require any extra trickery.

Re: Regex fun
by ikegami (Patriarch) on Dec 15, 2009 at 20:39 UTC

    I would assume that quantifier cannot be a '\x' variable, but don't really know.

    It's simpler than that: The quantifier cannot be variable. I presume it could be, but no one's done the work.

      It's simpler than that: The quantifier cannot be variable.
      This must mean something other than what it seems (to me) to mean:
      $ perl -E 'my $quant = 2; "ab" =~ /.{$quant}/ and say "Matched"' Matched
      shows that the ‘length’ of the quantifier can be given by a variable.

        I didn't say you couldn't build regexps dynamically. I said the quantifier can't be variable. Perl regular expressions don't even have variables, so you couldn't possibly have shown one being used.

        You can't have a quantifier until you have a regexp, and you don't have a regexp until you interpolate anything that needs to be interpolated.

        What's the quantifier in the following?

        $ perl -E 'my $x = "2"; "ab" =~ /.{$x}/ and say "Matched"' Matched

        Are you saying it's different in this one?

        $ perl -E 'my $x = "{2}"; "ab" =~ /.$x/ and say "Matched"' Matched

        What about this one?

        $ perl -E 'my $x = "2}"; "ab" =~ /.{$x/ and say "Matched"' Matched

        In all cases, the quantifier is {2}. No variables is involved. Sure, the regexp is produced from a variable, but that has nothing to do with the quantifier.

        Update: Clarified.

Re: Regex fun
by JadeNB (Chaplain) on Dec 15, 2009 at 19:01 UTC
    I would assume that quantifier cannot be a '\x' variable, but don't really know.

    I think it's important to note that \1 is not a variable * (which is why you can't use it outside of a regex); the variable that contains the contents of the first capture group is $1, but that 's empty doesn't take on its new value *** until the capture has completed **.

    I think that the reason that ($rx){\1} isn't allowed is that the regex engine wants to compile the regex before running it. Since the contents of \1, hence the number of times that $rx is supposed to be captured, aren't known until run-time, this interferes with the compilation. For example, /\+32767.{32767}/ is rejected at compile time, but a '+32767' =~ /\+([0-9]*).{\1}/ construct would circumvent this restriction. (“Why, then,” you ask, “is something like /(.)\1/, which suffers from the same compilation problem, OK?” I dunno. :-) )

    * Not a Perl variable, anyway. See Re^3: Regex fun, and probably Re^2: Regex fun as well.
    ** Except that (?{ print $1 }) works correctly, which is somewhat miraculous to me and very very helpful for debugging regexes.
    UPDATE: *** Still false (see Re^6: Regex fun for where realisation finally dawns). It takes on its new value as soon as the capture completes (which explains the miracle referenced above); it's just that the interpolation in the text of the regex has already happened, so that the quantifier doesn't ‘see’ the new value.

      I think it's important to note that \1 is not a variable (which is why you can't use it outside of a regex);
      But you can, sometimes, use it in the replacement part.
      think it's important to note that \1 is not a variable (which is why you can't use it outside of a regex); the variable that contains the contents of the first capture group is $1, but that's empty until the capture has completed.
      But in /([0-9]+){$1}/, the first capture is completed before the quantifier. So, that's not the reason.
      For example, /\+32767.{32767}/ is rejected at compile time
      Yes, but that's considered a bug. It's a restriction that should have been removed after the regexp engine was no longer recursive.
      “Why, then,” you ask, “is something like /(.)\1/, which suffers from the same compilation problem, OK?”
      That's not the same problem. {...} is one of the mini-languages inside regular expressions. Compare it with [...]. [\1] doesn't refer back to something else either.

      But one can defer a subpattern. The syntax is (??{ }). This is what the OP wants, and this is what the OP ought to use.

        But you can, sometimes, use it in the replacement part.
        Sure, but you're not supposed to: Warning on \1 Instead of $1.
        But in /([0-9]+){$1}/, the first capture is completed before the quantifier. So, that's not the reason.
        Sorry, I don't understand—not the reason for what?
        It's a restriction that should have been removed after the regexp engine was no longer recursive.
        Sorry, I don't understand this, either. Do you mean ‘re-entrant’? (UPDATE: Nope, just my internals-ignorance revealed. Thanks, ikegami!)
Re: Regex fun
by Hena (Friar) on Dec 15, 2009 at 14:48 UTC
    Thanks for all the help. Now I know that captured value cannot be used as quantifier :).
Re: Regex fun
by Anonymous Monk on Dec 15, 2009 at 14:33 UTC
    Something like this seems to work for printing the next \d+ characters after a numerical match:

    /(\d+)(?{$' =~ m!(.{$1})!; print "$1\n"})/

    Is this really the road you want to go down?