http://qs321.pair.com?node_id=812913


in reply to Regex fun

I can do the deed with
while (m/\+([0-9]+)[ACGTNacgtn]/g) { print "diff+: $1\n"; my $m = $1; s/\+[0-9]+[ACGTNacgtn]{$m}// }
But that's not quite so nice.

I understand that you meant “It's not so nice because I'd like a single regex”, but it's also not so nice because you're doing some work (matching the number and a single base) twice. You might prefer something like

s/\G\+$1[$bases]{$1}// while /(?=\+([0-9]+))/g;
which is at least 1 line, if 2 regexes. :-)

Here's a fairly naughty single-regex approach:

1 while s/(?<=\+)([0-9]+)[$bases]/$1 - 1/eg;

UPDATE: Changed the patterns to use look-around. Your version and my first will both loop forever on a mal-formed string like +2G, whereas the second one will just reduce it to +1 and terminate.
UPDATE: As Hena points out, I forgot a base case in my induction! The following fixes it (at least if no +0 strings are allowed), but loses a lot of the fun:

1 while s/\+([0-9]+)[aAgGcC]/$1 > 1 ? '+' . $1 - 1 : ''/eg;

Replies are listed 'Best First'.
Re^2: Regex fun
by Hena (Friar) on Dec 16, 2009 at 07:51 UTC
    I would like single regex as, as was pointed out, the double regex is wasteful since the matching has to be done twice. Also I think it would be better from readability point of view :).

    However you example 1 while s/(?<=\+)([0-9]+)[$bases]/$1 - 1/eg; doesn't quite work. I'm trying to remove all the numbers and the associated bases for it. Running yours left strange things in it (+-1 and +0 below).
    ..G..C...G.,,,...G,G,G...G,G...GG,.G.G...G.G.....G..G,........G....,GG +GG.,,,..G...,,.G.G..G..G..G.G.GG..GG.G..,,G, ,CG,G,GG..GG.G.GGGG,,..GG...G.,G.GG.,G,G....,,.GGGGGG.GCG..G,,G,.G..G, +,,G,.GGGG.,..G...,,,,G,,G..GGGGA.,,,,,.+-1.G.,,G,. ..G..GG,.G....+0..GG..G,,G,,G.G,,.,,,.,,.CG.,,,,.,..G.,,,.,.,,GGGGGG,, +.....G..GGGGG.,.G,,GG.G..GG,,,....,.,..,G.,.,,,.,, ,,G,,,.,.,..,.,,,...GG,.,G.,G......,,,..,,........,..,.,,.,...,,..,.,C +,..,,,.,,,,,....,,..,,,.,.....,.,,.,...,,.,,,-1a.,,,,,.,,,,,,..,..... +.....,,,,,.,...,,.,,,.,,.,,,,,,


    Edit: I think adding /c in the match pattern should fix the problem with my while and possibly the (??{...}) version as well?
      I would like single regex as, as was pointed out, the double regex is wasteful since the matching has to be done twice.

      The 2-regex version that I proposed avoids a lot of the double matching (it converts 2 number searches into 1 number search and then a hunt for a literal string). However, only benchmarking (which I'm too lazy to do) will show whether it's actually faster.

      If s/// set pos (and behaved like m// in a while loop), then one could avoid any doubled effort at all:

      s/\G\+$1[$bases]{$1}// while s/\+([0-9]+)//g;
      (UPDATE: but note that this is fanciful, non-working code). In fact, that's what I had originally, until I tested it and discovered that it didn't work.

      Running yours left strange things in it (+-1 and +0 below).

      Oops, sorry! My original regex wasn't smart enough to stop grabbing bases once the count indicated that there were no more needed. I've fixed it, but, since the only point was its brevity (it's frightfully inefficient), it's not much fun any more.

      I think adding /c in the match pattern should fix the problem with my while and possibly the (??{...}) version as well?

      I'm not sure what you mean. I am pretty sure (but can't find the documentation) that /c only has an effect on the semantics of failed matches, and those aren't our problem here. Notice that (??{ }) doesn't have a problem to be fixed—the application you have in mind is essentially exactly for what that escape was designed (I assume), and it doesn't require any extra trickery.

        It still didn't work as needed. That didn't remove the bases after the +N. However it did help me a lot and I managed to get a version which does indeed work as I want it (prints just to keep me in clear that what happens is indeed what should happen). I also added the || last to prevent that eternal loop you mentioned.

        while (m/[+-]([0-9]+)/g) { printf "pos ($1): %d -> ",pos($_)-1; pos($_) -= length($1)+1; printf "pos ($1): %d\n",pos($_)-1; s/\G[+-]$1[ACGTNacgtn]{$1}// || last; }
        So thanks for helping me out on this :).

        Though on the whole. I think that the (??{...}) would be best choice as repositioning the pos() is probably not a good idea in general. Note that I included the negative as well as positive match in this as that would remove the next regex I have :).