Re: Perl Module for identifying country name
by frozenwithjoy (Priest) on Aug 03, 2012 at 15:36 UTC
|
It seems like the biggest hurdle is getting a list of countries. To overcome this, you could use Locale::Country to do: @country_names = all_country_names();
Then you could do something like put the countries in one hash and the words from the file in another hash and look for overlapping keys. use List::Compare:
$lc = List::Compare->new( \@country_names, \@words_in_file );
@countries_in_file = $lc->get_intersection;
Edit: Now that I think about it a little more, it might be better to use your array of country names to grep through your file contents (after replacing new lines with spaces) to avoid issues with multi-word countries names. | [reply] [d/l] [select] |
|
| [reply] |
Re: Perl Module for identifying country name
by TomDLux (Vicar) on Aug 03, 2012 at 16:09 UTC
|
You can search for any group of strings you wish to. The problem is, what are the possible values. Will it be the English name or the German on: Germany or Deutchland? Will it be the current name or an older one: Myamar or Burma? Sri Lanka or Ceylon? Mumbai or Bombay?
If you have a file with one value per line, you can use "grep -f countries datafile" to examine datafile for all the countries in the countries file. The perl equivalent is simple:
- read in the set of countries into an array
- form into a regular expression which will capture the found string:
my $re_text = join '|', map {($_)} @countries;
my $re = rx/$re_text/;
- and then test each input line against the re:
while ( my $line = <$fh>) {
chomp $line;
my $found = ($line =~ /$re/);
# Profit!
}
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] [d/l] [select] |
|
What use is the map {($_)} in your regex-making code? You will create a great many "captures" which are unnecessary.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] |
Re: Perl Module for identifying country name
by ww (Archbishop) on Aug 03, 2012 at 17:11 UTC
|
... and do you want to know about instances of "los Estados Unidos" ou "les Etats Unis;" about "Bundesrepublik Deutschland" (auf Deutsch) oder "Alemania" (língua portuguesa) or Chine, perhaps in one of the several written forms of Chinese?
In other words, what you didn't tell us is "Is the text file guaranteed to be Angličané jazyk? Is Czech or some other language possible?" | [reply] |
Re: Perl Module for identifying country name
by talexb (Chancellor) on Aug 03, 2012 at 14:59 UTC
|
It's a little difficult to comprehend what you're asking for, but my guess is that you could achieve your goal by using grep on the file for the country name that you're looking for. Does that help?
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] [d/l] |
|
Actually what I want is just to find which country names are there in a text file
for grep i think i will need to mention if it is United States or Germany right? This way I can miss the country name Canada if it is in the file
| [reply] |
|
You can use a regular expression to find all (English) country names.
(?-xism:(?:S(?:a(?:int (?:(?:Vincent and the Grenadine|Kitts and Nevi)
+s|Lucia)|o Tome and Principe|(?:udi Arabi|mo)a|n Marino)|o(?:uth (?:(
+?:Afric|Kore)a|Sudan)|lomon Islands|malia)|(?:(?:lov(?:ak|en)|yr)i|ri
+ Lank)a|w(?:(?:itzer|azi)land|eden)|e(?:ychelles|negal|rbia)|i(?:erra
+ Leon|ngapor)e|u(?:riname|dan)|pain)|B(?:o(?:(?:snia and Herzegovi|ts
+wa)n|livi)a|a(?:h(?:amas|rain)|ngladesh|rbados)|u(?:r(?:kina Faso|und
+i|ma)|lgaria)|e(?:l(?:arus|gium|ize)|nin)|r(?:azil|unei)|hutan)|M(?:a
+(?:l(?:a(?:ysia|wi)|dives|ta|i)|urit(?:ania|ius)|c(?:edonia|au)|rshal
+l Islands|dagascar)|o(?:n(?:(?:tenegr|ac)o|golia)|zambique|ldova|rocc
+o)|icronesia|exico)|C(?:o(?:(?:sta Ric|lombi)a|te d'Ivoire|moros)|a(?
+:m(?:bodia|eroon)|pe Verde|nada)|(?:entral African|zech) Republic|h(?
+:i(?:le|na)|ad)|(?:roati|ub)a|yprus)|T(?:u(?:rk(?:menistan|ey)|nisia|
+valu)|a(?:(?:jikist|iw)an|nzania)|rinidad and Tobago|o(?:nga|go)|imor
+-Leste|hailand)|A(?:(?:n(?:tigua and Barbud|dorr|gol)|(?:l(?:ban|ger)
+|ustr(?:al)?)i|r(?:gentin|meni))a|(?:fghanist|zerbaij)an)|P(?:a(?:l(?
+:estinian Territories|au)|(?:pua New Guine|nam)a|kistan|raguay)|o(?:r
+tugal|land)|hilippines|eru)|N(?:e(?:therland(?:s Antille)?s|w Zealand
+|pal)|i(?:ger(?:ia)?|caragua)|or(?:th Korea|way)|a(?:mibia|uru))|G(?:
+u(?:inea(?:-Bissau)?|(?:atemal|yan)a)|e(?:orgia|rmany)|re(?:nada|ece)
+|a(?:mbia|bon)|hana)|E(?:(?:(?:quatorial Guin|ritr)e|(?:thiop|ston)i)
+a|(?:(?:l Salv|cu)ad|ast Tim)or|gypt)|L(?:i(?:(?:b(?:eri|y)|thuani)a|
+echtenstein)|e(?:banon|sotho)|a(?:tvia|os)|uxembourg)|U(?:nited (?:St
+ates of America|Arab Emirates|Kingdom)|zbekistan|kraine|ruguay|ganda)
+|D(?:e(?:mocratic Republic of the Congo|nmark)|ominica(?:n Republic)?
+|jibouti)|I(?:r(?:a[nq]|eland)|nd(?:ones)?ia|celand|srael|taly)|K(?:(
+?:azakh|yrgyz)stan|iribati|osovo|uwait|enya)|R(?:(?:(?:oman|uss)i|wan
+d)a|epublic of the Congo)|H(?:o(?:n(?:g Kong|duras)|ly See)|ungary|ai
+ti)|V(?:enezuela|anuatu|ietnam)|J(?:a(?:maica|pan)|ordan)|F(?:i(?:nla
+nd|ji)|rance)|Z(?:imbabwe|ambia)|(?:Yeme|Oma)n|Qatar))
BTW, you will not find "United States" with this regex since the official name is "United States of America".
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] |