Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Shift regex search back 4 characters...

by tj999 (Novice)
on Mar 04, 2017 at 03:54 UTC ( [id://1183627]=perlquestion: print w/replies, xml ) Need Help??

tj999 has asked for the wisdom of the Perl Monks concerning the following question:

I have a file containing a long string of three digit numbers separated by commas. Like 123,123,234,345,456,654,543....

For my purposes I need to check each three digit number for a match, then extract the three digit that follows if a match is found.

My code looks like this: (in this example I am looking for the number 222 then getting the three digit number that follows after a comma)

while(<INFILE>) { while (/222,(\d\d\d),/g) { print OUTFILE "\nAFTER 222-$1"; } }

It works as expected, but after finding a match it appears that perl begins searching again AFTER the second three digit number. So if the string was 222,222,123 it would return the first result as 222 following 222, but then it starts searching again at the third number 123. I want it to also capture the second match where 123 follows the second 222. What I am hoping to do is have the seach move back 4 characters after finding a match. Hopefully this explanation makes sense?

Thanks in advance to all who provide suggestions or advice. It is much appreciated. TJ.

Replies are listed 'Best First'.
Re: Shift regex search back 4 characters...
by Athanasius (Archbishop) on Mar 04, 2017 at 06:54 UTC

    Hello tj999, and welcome to the Monastery!

    Just to elaborate on tybalt89’s ++answer: the key here is the use of a positive lookahead assertion. Like other lookaround assertions, this is zero-width, so it is ignored when the regex engine is working out where to start looking for the next match during a global search (i.e., when the regex is in list context and has a /g modifier). Here are some references on lookahead assertions:

    BTW, note that tybalt89’s solution omits the final comma from the regex. With the comma included, your regex will not match the 999 in a string such as 123,222,456,222,222,111,222,999.

    Update 1: An illustration may make things clearer. Say your search string is "222,222,111", and the regex is /222,(\d\d\d)/g. The regex engine begins its search at the first character:

    222,222,111 ^ ======= <-- 1st match: 222,222 Capture: 222

    and finds a match. Then the search for the next match begins at the character immediately following the end of the previous match:

    222,222,111 ^

    Not finding a match here, it moves forward one character:

    222,222,111 ^

    and finds no match; and so on, one character at a time, to the end of the string.

    But if the regex has a lookahead assertion, /222,(?=(\d\d\d))/g, the search for a second match again begins one character beyond the end of the previous match, but this time the lookahead assertion itself is not counted as a part of that match, so the regex engine starts looking here:

    222,222,111 ^ ======= <-- 2nd match: 222,111 Capture: 111

    — and finds the second match. Note that the lookahead assertion has actually effectively1 shifted the regex search back by 3 characters, not 4 as implied by the title of this thread: a small point, but perhaps useful in helping to clarify what is going on.

    Updates 2 & 3: Re-wrote Update 1 to fix various errors.

    1Update 4: As AnomalousMonk notes, it would be more accurate “to say that a zero-width assertion does not move the search position at all, even if it captures.”

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Shift regex search back 4 characters...
by tybalt89 (Monsignor) on Mar 04, 2017 at 04:07 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1183627 use strict; use warnings; $_ = '222,345,222,678,222,222,543,111'; while( /222,(?=(\d\d\d))/g ) { print "AFTER 222-$1\n"; }
      Thank you for your reply!!!
Re: Shift regex search back 4 characters...
by kcott (Archbishop) on Mar 04, 2017 at 09:41 UTC

    G'day tj999,

    Welcome to the Monastery.

    Before seeing other responses, my first thought was a lookbehind assertion:

    /(?<=222,)(\d{3})/g

    Having seen other solutions, I think this is clearer than embedding a capture in an assertion.

    Here are some tests using your example string and the two others used in earlier replies.

    $ perl -E 'say for "222,222,123" =~ /(?<=222,)(\d{3})/g' 222 123 $ perl -E 'say for "222,345,222,678,222,222,543,111" =~ /(?<=222,)(\d{ +3})/g' 345 678 222 543 $ perl -E 'say for "123,222,456,222,222,111,222,999" =~ /(?<=222,)(\d{ +3})/g' 456 222 111 999

    — Ken

      I prefer your lookbehind solution, but I expect tybalt89 will prefer the lookahead because it solves the problem the way he expected ("move back 4 characters after finding a match")
      Bill

        I doubt that tybalt89 expected any sort of backward movement because I think he or she understands the point illustrated by Athanasius here that a zero-width assertion simply does not move the match start-position at all even if it should happen to capture something. Just for grins, here's a solution that actually does move the match position backwards, although I don't think it will be much to your taste:

        c:\@Work\Perl\monks>perl -wMstrict -le "$_ = '222,345,222,678,222,223,224,222,543'; ;; 1 while m{ 22[234] , (*MARK:BACK3) (\d\d\d) (?{ print qq{'$^N'} }) (*SKIP:BACK3) (*FAIL) }xmsg; " '345' '678' '223' '224' '222' '543'
        Please see Special Backtracking Control Verbs in perlre from Perl version 5.10 onward.

        BTW: One reason tybalt89's first thought was for a look-ahead may have been that Perl's regex engine does not support variable-width look-behind, so using a look-behind may simply be a prelude to a re-write that switches to using a look-ahead when a need for some variability is discovered.


        Give a man a fish:  <%-{-{-{-<

Re: Shift regex search back 4 characters...
by tybalt89 (Monsignor) on Mar 04, 2017 at 22:39 UTC

    In the spirit of TIMTOWTDI, here is one way to actually move the search point back.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1183627 use strict; use warnings; $_ = '222,345,222,678,222,222,543,111'; while( /222,(\d\d\d)/g ) { print "AFTER 222-$1\n"; pos($_) -= 4; # actually move back 4 characters }

      A variation that doesn't require foreknowledge of the backward step length:

      c:\@Work\Perl\monks>perl -wMstrict -le "$_ = '222,345,222,678,222,223,224,543,111,222'; ;; while ( /22[2345],(\d\d\d)/g ) { print qq{after 22n: '$1'}; pos = $-[1]; } " after 22n: '345' after 22n: '678' after 22n: '223' after 22n: '224' after 22n: '543'


      Give a man a fish:  <%-{-{-{-<

Re: Shift regex search back 4 characters...
by ablanke (Monsignor) on Mar 08, 2017 at 10:54 UTC
    Hi,
    maybe i'm simplifying your requirements too much or it's a matter of performance, but you could solve this by splitting the numbers.
    #!/usr/bin/perl use strict; use warnings; my $numbers = '222,345,222,678,222,222,543,111,222'; my @numbers = split(',', $numbers); my $i = 1; for my $number (@numbers) { if ( '222' eq $number && $i < scalar(@numbers) ) { print "AFTER $number-$numbers[$i]\n"; } $i++; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1183627]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-23 22:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found