Perl Module for identifying country name

maheshkumar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl Module for identifying country name by frozenwithjoy (Priest) on Aug 03, 2012 at 15:36 UTC
It seems like the biggest hurdle is getting a list of countries. To overcome this, you could use Locale::Country to do: `@country_names = all_country_names();` Then you could ~~do something like put the countries in one hash and the words from the file in another hash and look for overlapping keys.~~ use List::Compare: `$lc = List::Compare->new( \@country_names, \@words_in_file ); @countries_in_file = $lc->get_intersection;` [download] Edit: Now that I think about it a little more, it might be better to use your array of country names to grep through your file contents (after replacing new lines with spaces) to avoid issues with multi-word countries names.	[reply] [d/l] [select]
Re^2: Perl Module for identifying country name by maheshkumar (Sexton) on Aug 04, 2012 at 23:54 UTC
Already used Locale::Country to put all names of countries in an array and i am getting the countries that appear in a file name :)	[reply]
Re: Perl Module for identifying country name by TomDLux (Vicar) on Aug 03, 2012 at 16:09 UTC
You can search for any group of strings you wish to. The problem is, what are the possible values. Will it be the English name or the German on: Germany or Deutchland? Will it be the current name or an older one: Myamar or Burma? Sri Lanka or Ceylon? Mumbai or Bombay? If you have a file with one value per line, you can use "grep -f countries datafile" to examine datafile for all the countries in the countries file. The perl equivalent is simple: read in the set of countries into an array form into a regular expression which will capture the found string: `my $re_text = join '\|', map {($_)} @countries; my $re = rx/$re_text/;` [download] and then test each input line against the re: `while ( my $line = <$fh>) { chomp $line; my $found = ($line =~ /$re/); # Profit! }` [download] As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply] [d/l] [select]
Re^2: Perl Module for identifying country name by CountZero (Bishop) on Aug 03, 2012 at 17:20 UTC
What use is the `map {($_)}` in your regex-making code? You will create a great many "captures" which are unnecessary. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l]
Re: Perl Module for identifying country name by ww (Archbishop) on Aug 03, 2012 at 17:11 UTC
... and do you want to know about instances of "los Estados Unidos" ou "les Etats Unis;" about "Bundesrepublik Deutschland" (auf Deutsch) oder "Alemania" (língua portuguesa) or Chine, perhaps in one of the several written forms of Chinese? In other words, what you didn't tell us is "Is the text file guaranteed to be Angličané jazyk? Is Czech or some other language possible?"	[reply]
Re: Perl Module for identifying country name by talexb (Chancellor) on Aug 03, 2012 at 14:59 UTC
It's a little difficult to comprehend what you're asking for, but my guess is that you could achieve your goal by using `grep` on the file for the country name that you're looking for. Does that help? Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply] [d/l]
Re^2: Perl Module for identifying country name by maheshkumar (Sexton) on Aug 03, 2012 at 15:07 UTC
Actually what I want is just to find which country names are there in a text file for grep i think i will need to mention if it is United States or Germany right? This way I can miss the country name Canada if it is in the file	[reply]
Re^3: Perl Module for identifying country name by CountZero (Bishop) on Aug 03, 2012 at 15:54 UTC
You can use a regular expression to find all (English) country names. (?-xism:(?:S(?:a(?:int (?:(?:Vincent and the Grenadine\|Kitts and Nevi) +s\|Lucia)\|o Tome and Principe\|(?:udi Arabi\|mo)a\|n Marino)\|o(?:uth (?:( +?:Afric\|Kore)a\|Sudan)\|lomon Islands\|malia)\|(?:(?:lov(?:ak\|en)\|yr)i\|ri + Lank)a\|w(?:(?:itzer\|azi)land\|eden)\|e(?:ychelles\|negal\|rbia)\|i(?:erra + Leon\|ngapor)e\|u(?:riname\|dan)\|pain)\|B(?:o(?:(?:snia and Herzegovi\|ts +wa)n\|livi)a\|a(?:h(?:amas\|rain)\|ngladesh\|rbados)\|u(?:r(?:kina Faso\|und +i\|ma)\|lgaria)\|e(?:l(?:arus\|gium\|ize)\|nin)\|r(?:azil\|unei)\|hutan)\|M(?:a +(?:l(?:a(?:ysia\|wi)\|dives\|ta\|i)\|urit(?:ania\|ius)\|c(?:edonia\|au)\|rshal +l Islands\|dagascar)\|o(?:n(?:(?:tenegr\|ac)o\|golia)\|zambique\|ldova\|rocc +o)\|icronesia\|exico)\|C(?:o(?:(?:sta Ric\|lombi)a\|te d'Ivoire\|moros)\|a(? +:m(?:bodia\|eroon)\|pe Verde\|nada)\|(?:entral African\|zech) Republic\|h(? +:i(?:le\|na)\|ad)\|(?:roati\|ub)a\|yprus)\|T(?:u(?:rk(?:menistan\|ey)\|nisia\| +valu)\|a(?:(?:jikist\|iw)an\|nzania)\|rinidad and Tobago\|o(?:nga\|go)\|imor +-Leste\|hailand)\|A(?:(?:n(?:tigua and Barbud\|dorr\|gol)\|(?:l(?:ban\|ger) +\|ustr(?:al)?)i\|r(?:gentin\|meni))a\|(?:fghanist\|zerbaij)an)\|P(?:a(?:l(? +:estinian Territories\|au)\|(?:pua New Guine\|nam)a\|kistan\|raguay)\|o(?:r +tugal\|land)\|hilippines\|eru)\|N(?:e(?:therland(?:s Antille)?s\|w Zealand +\|pal)\|i(?:ger(?:ia)?\|caragua)\|or(?:th Korea\|way)\|a(?:mibia\|uru))\|G(?: +u(?:inea(?:-Bissau)?\|(?:atemal\|yan)a)\|e(?:orgia\|rmany)\|re(?:nada\|ece) +\|a(?:mbia\|bon)\|hana)\|E(?:(?:(?:quatorial Guin\|ritr)e\|(?:thiop\|ston)i) +a\|(?:(?:l Salv\|cu)ad\|ast Tim)or\|gypt)\|L(?:i(?:(?:b(?:eri\|y)\|thuani)a\| +echtenstein)\|e(?:banon\|sotho)\|a(?:tvia\|os)\|uxembourg)\|U(?:nited (?:St +ates of America\|Arab Emirates\|Kingdom)\|zbekistan\|kraine\|ruguay\|ganda) +\|D(?:e(?:mocratic Republic of the Congo\|nmark)\|ominica(?:n Republic)? +\|jibouti)\|I(?:r(?:a[nq]\|eland)\|nd(?:ones)?ia\|celand\|srael\|taly)\|K(?:( +?:azakh\|yrgyz)stan\|iribati\|osovo\|uwait\|enya)\|R(?:(?:(?:oman\|uss)i\|wan +d)a\|epublic of the Congo)\|H(?:o(?:n(?:g Kong\|duras)\|ly See)\|ungary\|ai +ti)\|V(?:enezuela\|anuatu\|ietnam)\|J(?:a(?:maica\|pan)\|ordan)\|F(?:i(?:nla +nd\|ji)\|rance)\|Z(?:imbabwe\|ambia)\|(?:Yeme\|Oma)n\|Qatar)) [download] BTW, you will not find "United States" with this regex since the official name is "United States of America". CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks