tj999 has asked for the wisdom of the Perl Monks concerning the following question:
I have a file containing a long string of three digit numbers separated by commas. Like 123,123,234,345,456,654,543....
For my purposes I need to check each three digit number for a match, then extract the three digit that follows if a match is found.
My code looks like this: (in this example I am looking for the number 222 then getting the three digit number that follows after a comma)
while(<INFILE>) {
while (/222,(\d\d\d),/g)
{
print OUTFILE "\nAFTER 222-$1";
}
}
It works as expected, but after finding a match it appears that perl begins searching again AFTER the second three digit number. So if the string was 222,222,123 it would return the first result as 222 following 222, but then it starts searching again at the third number 123. I want it to also capture the second match where 123 follows the second 222. What I am hoping to do is have the seach move back 4 characters after finding a match. Hopefully this explanation makes sense?
Thanks in advance to all who provide suggestions or advice. It is much appreciated. TJ.
Re: Shift regex search back 4 characters...
by Athanasius (Archbishop) on Mar 04, 2017 at 06:54 UTC
|
Hello tj999, and welcome to the Monastery!
Just to elaborate on tybalt89’s ++answer: the key here is the use of a positive lookahead assertion. Like other lookaround assertions, this is zero-width, so it is ignored when the regex engine is working out where to start looking for the next match during a global search (i.e., when the regex is in list context and has a /g modifier). Here are some references on lookahead assertions:
BTW, note that tybalt89’s solution omits the final comma from the regex. With the comma included, your regex will not match the 999 in a string such as 123,222,456,222,222,111,222,999.
Update 1: An illustration may make things clearer. Say your search string is "222,222,111", and the regex is /222,(\d\d\d)/g. The regex engine begins its search at the first character:
222,222,111
^
======= <-- 1st match: 222,222 Capture: 222
and finds a match. Then the search for the next match begins at the character immediately following the end of the previous match:
222,222,111
^
Not finding a match here, it moves forward one character:
222,222,111
^
and finds no match; and so on, one character at a time, to the end of the string.
But if the regex has a lookahead assertion, /222,(?=(\d\d\d))/g, the search for a second match again begins one character beyond the end of the previous match, but this time the lookahead assertion itself is not counted as a part of that match, so the regex engine starts looking here:
222,222,111
^
======= <-- 2nd match: 222,111 Capture: 111
— and finds the second match. Note that the lookahead assertion has actually effectively1 shifted the regex search back by 3 characters, not 4 as implied by the title of this thread: a small point, but perhaps useful in helping to clarify what is going on.
Updates 2 & 3: Re-wrote Update 1 to fix various errors.
1Update 4: As AnomalousMonk notes, it would be more accurate “to say that a zero-width assertion does not move the search position at all, even if it captures.”
Hope that helps,
| [reply] [d/l] [select] |
Re: Shift regex search back 4 characters...
by tybalt89 (Monsignor) on Mar 04, 2017 at 04:07 UTC
|
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1183627
use strict;
use warnings;
$_ = '222,345,222,678,222,222,543,111';
while( /222,(?=(\d\d\d))/g )
{
print "AFTER 222-$1\n";
}
| [reply] [d/l] |
|
Thank you for your reply!!!
| [reply] |
Re: Shift regex search back 4 characters...
by kcott (Archbishop) on Mar 04, 2017 at 09:41 UTC
|
G'day tj999,
Welcome to the Monastery.
Before seeing other responses, my first thought was a lookbehind assertion:
/(?<=222,)(\d{3})/g
Having seen other solutions, I think this is clearer than embedding a capture in an assertion.
Here are some tests using your example string and the two others used in earlier replies.
$ perl -E 'say for "222,222,123" =~ /(?<=222,)(\d{3})/g'
222
123
$ perl -E 'say for "222,345,222,678,222,222,543,111" =~ /(?<=222,)(\d{
+3})/g'
345
678
222
543
$ perl -E 'say for "123,222,456,222,222,111,222,999" =~ /(?<=222,)(\d{
+3})/g'
456
222
111
999
| [reply] [d/l] [select] |
|
I prefer your lookbehind solution, but I expect tybalt89 will prefer the lookahead because it solves the problem the way he expected ("move back 4 characters after finding a match")
| [reply] |
|
I doubt that tybalt89 expected any sort of backward movement because I think he or she understands the point illustrated by Athanasius here that a zero-width assertion simply does not move the match start-position at all even if it should happen to capture something. Just for grins, here's a solution that actually does move the match position backwards, although I don't think it will be much to your taste:
c:\@Work\Perl\monks>perl -wMstrict -le
"$_ = '222,345,222,678,222,223,224,222,543';
;;
1 while m{
22[234] , (*MARK:BACK3)
(\d\d\d) (?{ print qq{'$^N'} })
(*SKIP:BACK3) (*FAIL)
}xmsg;
"
'345'
'678'
'223'
'224'
'222'
'543'
Please see Special Backtracking Control Verbs in perlre from Perl version 5.10 onward.
BTW: One reason tybalt89's first thought was for a look-ahead may have been that Perl's regex engine does not support variable-width look-behind, so using a look-behind may simply be a prelude to a re-write that switches to using a look-ahead when a need for some variability is discovered.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Shift regex search back 4 characters...
by tybalt89 (Monsignor) on Mar 04, 2017 at 22:39 UTC
|
In the spirit of TIMTOWTDI, here is one way to actually move the search point back.
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1183627
use strict;
use warnings;
$_ = '222,345,222,678,222,222,543,111';
while( /222,(\d\d\d)/g )
{
print "AFTER 222-$1\n";
pos($_) -= 4; # actually move back 4 characters
}
| [reply] [d/l] |
|
c:\@Work\Perl\monks>perl -wMstrict -le
"$_ = '222,345,222,678,222,223,224,543,111,222';
;;
while ( /22[2345],(\d\d\d)/g ) {
print qq{after 22n: '$1'};
pos = $-[1];
}
"
after 22n: '345'
after 22n: '678'
after 22n: '223'
after 22n: '224'
after 22n: '543'
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Shift regex search back 4 characters...
by ablanke (Monsignor) on Mar 08, 2017 at 10:54 UTC
|
Hi,
maybe i'm simplifying your requirements too much or it's a matter of performance, but you could solve this by splitting the numbers.
#!/usr/bin/perl
use strict;
use warnings;
my $numbers = '222,345,222,678,222,222,543,111,222';
my @numbers = split(',', $numbers);
my $i = 1;
for my $number (@numbers) {
if ( '222' eq $number && $i < scalar(@numbers) ) {
print "AFTER $number-$numbers[$i]\n";
}
$i++;
}
| [reply] [d/l] |
|
|