This will fail in some cases.
Geographically, Guinea is thousands of miles from here. (This fails immediately because of the comma; if the comma were removed, it would still fail.)
If what you want is to match Guinea but not New Guinea or Equatorial Guinea, then what you probably really want is a negative lookbehind assertion that specifically rules out being preceded by "New " or "Equatorial ". Similarly, a negative lookahead assertion at the end can preclude Guinea Pig and Guinnea-Bisseau.
| [reply] [Watch: Dir/Any] |
If what you want is to match Guinea but not New Guinea or Equatorial Guinea, then what you probably really want is a negative lookbehind assertion that specifically rules out being preceded by "New " or "Equatorial "
One caveat: You can't use alternation in the look-behind assertion because variable-length negative look-behind assertion isn't supported. Instead, you must list the alternatives separately. You can, of course, use alternation in the look-ahead assertion.
use strict;
use warnings;
my $pattern = qr{
(?<!New\s)
(?<!Equatorial\s)
Guinea
(?![\s-](?:Bissau|pig))
}ix;
while (my $text = <DATA>) {
my $match = $text =~ m/$pattern/ ? 1 : 0;
print "$match $text";
# This prints...
# 0 Papua New Guinea
# 1 I live in Guinea.
# 1 i live in guinea, but i don't have a shift key.
# 0 Guinea-Bissau
# 0 Guinea Bissau
# 0 Equatorial Guinea
# 0 I love guinea pigs!
}
__DATA__
Papua New Guinea
I live in Guinea.
i live in guinea, but i don't have a shift key.
Guinea-Bissau
Guinea Bissau
Equatorial Guinea
I love guinea pigs!
| [reply] [Watch: Dir/Any] [d/l] |
You are right. However, even this code may fail, if somebody misspells the country names.
It is more a linguistic problem than a pattern recognition one, and, as such, seems extraordinary difficult to tackle in a failproof way (which would require an AI, a syntaxic and contextual analysis, etc.)
However, as you mentioned, using negative look-ahead and negative look-behind assertions should allow him to avoid the most common other words.
| [reply] [Watch: Dir/Any] |