... multiple names I am attempting to match. ... Just added if statements in the while loop.
if-statement patches are probably ok for one-off or infrequent runs with a small, stable city-name list. For larger lists of cities or more frequent runs, I think I would go with a database.
It's also possible to use a regex/hash approach:
c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my @cities = ('windsor riverside', ' new york ', 'philadelphia',);
;;
my $rx_city = build_city_regex(@cities);
print $rx_city;
;;
my %city_digits;
;;
RECORD:
for my $record (
'CA006139520,\"WINDSOR RIVERSIDE, ON CA \",2018-01-02,10',
qq{CA006139520,\" NEW YORK , ON CA \",2018-01-02,987\n},
'CA006139520,\"NEWYORK, ON CA \",2018-01-02,9999',
'CA006139520,\"NEW YORK, ON CA \",2018-01-02,10210',
qq{CA006139520,\"PHILADELPHIA, ON CA \",2018-01-02,76\n},
) {
next RECORD unless
my ($city, $digits) = $record =~ m{ ($rx_city) .* \b (\d+) \Z }xm
+s;
push @{ $city_digits{ canonicalize_city($city) } }, $digits
}
dd \%city_digits;
;;
sub build_city_regex {
my ($regex) =
map qr{ \b (?: $_) \b }xms,
join ' | ',
map { (my $c = $_) =~ s{ \s+ }'\s+'xmsg; $c; }
reverse sort
map canonicalize_city($_),
@_
;
return $regex;
}
;;
sub canonicalize_city {
my ($city_name) = @_;
;;
die qq{bad city: '$city_name'}
if $city_name =~ m{ [^[:alpha:] -] }xms;
$city_name =~ s{ \A \s+ | \s+ \z }''xmsg;
$city_name =~ s{ \s+ }' 'xmsg;
$city_name = uc $city_name;
;;
return $city_name;
}
"
(?msx-i: \b (?: WINDSOR\s+RIVERSIDE | PHILADELPHIA | NEW\s+YORK) \b )
{
"NEW YORK" => [987, 10210],
PHILADELPHIA => [76],
"WINDSOR RIVERSIDE" => [10],
}
Something like this will work even with large lists (thousands!) of city names. However, as I said, for a sufficiently high size-frequency metric, it's probably better to use a database.
Give a man a fish: <%-{-{-{-<
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.