... multiple names I am attempting to match. ... Just added if statements in the while loop.
if-statement patches are probably ok for one-off or infrequent runs with a small, stable city-name list. For larger lists of cities or more frequent runs, I think I would go with a database.
It's also possible to use a regex/hash approach:
c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my @cities = ('windsor riverside', ' new york ', 'philadelphia',);
;;
my $rx_city = build_city_regex(@cities);
print $rx_city;
;;
my %city_digits;
;;
RECORD:
for my $record (
'CA006139520,\"WINDSOR RIVERSIDE, ON CA \",2018-01-02,10',
qq{CA006139520,\" NEW YORK , ON CA \",2018-01-02,987\n},
'CA006139520,\"NEWYORK, ON CA \",2018-01-02,9999',
'CA006139520,\"NEW YORK, ON CA \",2018-01-02,10210',
qq{CA006139520,\"PHILADELPHIA, ON CA \",2018-01-02,76\n},
) {
next RECORD unless
my ($city, $digits) = $record =~ m{ ($rx_city) .* \b (\d+) \Z }xm
+s;
push @{ $city_digits{ canonicalize_city($city) } }, $digits
}
dd \%city_digits;
;;
sub build_city_regex {
my ($regex) =
map qr{ \b (?: $_) \b }xms,
join ' | ',
map { (my $c = $_) =~ s{ \s+ }'\s+'xmsg; $c; }
reverse sort
map canonicalize_city($_),
@_
;
return $regex;
}
;;
sub canonicalize_city {
my ($city_name) = @_;
;;
die qq{bad city: '$city_name'}
if $city_name =~ m{ [^[:alpha:] -] }xms;
$city_name =~ s{ \A \s+ | \s+ \z }''xmsg;
$city_name =~ s{ \s+ }' 'xmsg;
$city_name = uc $city_name;
;;
return $city_name;
}
"
(?msx-i: \b (?: WINDSOR\s+RIVERSIDE | PHILADELPHIA | NEW\s+YORK) \b )
{
"NEW YORK" => [987, 10210],
PHILADELPHIA => [76],
"WINDSOR RIVERSIDE" => [10],
}
Something like this will work even with large lists (thousands!) of city names. However, as I said, for a sufficiently high size-frequency metric, it's probably better to use a database.
Give a man a fish: <%-{-{-{-<