Adding to
Your Mother's excellent advice above, you'll love the predefined unicode character classes for the various scripts. Here's a minor enhancement to the script provided above (again, using "pre" tags to avoid the mangling of non-ascii characters):
#!/usr/bin/perl
use utf8;
use strictures;
use HTML::Entities "encode_entities_numeric";
binmode STDOUT, ":encoding(UTF-8)";
# OR use Encode, print encode_utf8(...)
while (<DATA>)
{
chomp;
next unless /\w/;
my $script_label = "";
for my $script ( qw/Arabic Greek Hebrew/ ) {
$script_label .= " has $script" if ( /\p{$script}/ );
}
print $_, $/;
print " -> ", length, " characters long; $script_label", $/;
print " -> ", encode_entities_numeric($_), $/;
}
__DATA__
antennæ
עברית
Ελληνικά
العَرَبِية
The output I got from that was:
antennæ
-> 7 characters long;
-> antennæ
עברית
-> 5 characters long; has Hebrew
-> עברית
Ελληνικά
-> 8 characters long; has Greek
-> Ελληνικά
العَرَبِية
-> 10 characters long; has Arabic
-> العَرَبِية
To put that another way, you can match and store strings of characters in particular, language-specific scripts with something like this:
# Assuming $_ contains the input:
my @hebrew_parts = /\p{Hebrew}+/g;
my @arabic_parts = /\p{Arabic}+/g;
my @greek_parts = /\p{Greek}+/g;
Similarly for Han, Cyrillic, Ethiopic, Thai, Devanagari, etc. (As shown above, you have the option of parameterizing the script label as a loop variable.)