While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with *an* encoding at input and *an* encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc.
This little snippet might get you started. I had to use <pre/> tags because PM's <code/> tags don't like wide characters. :P
use utf8;
use strictures;
use HTML::Entities "encode_entities_numeric";
binmode STDOUT, ":encoding(UTF-8)";
# OR use Encode, print encode_utf8(...)
while (<DATA>)
{
chomp;
next unless /\w/;
print $_, $/;
print " -> ", length, " characters long", $/;
print " -> ", encode_entities_numeric($_), $/;
}
__DATA__
antennæ
עברית
Ελληνικά
العَرَبِية
antennæ
-> 7 characters long
-> antennæ
עברית
-> 5 characters long
-> עברית
Ελληνικά
-> 8 characters long
-> Ελληνικά
العَرَبِية
-> 11 characters long
-> العَرَبِية‎
Further reading: Encode, utf8, perlunitut. Branch out from those as desired. | [reply] [d/l] [select] |
Wouldn't HTML::Entities fit the bill already, without the recognition of the particular alphabets?
| [reply] |
Adding to Your Mother's excellent advice above, you'll love the predefined unicode character classes for the various scripts. Here's a minor enhancement to the script provided above (again, using "pre" tags to avoid the mangling of non-ascii characters):
#!/usr/bin/perl
use utf8;
use strictures;
use HTML::Entities "encode_entities_numeric";
binmode STDOUT, ":encoding(UTF-8)";
# OR use Encode, print encode_utf8(...)
while (<DATA>)
{
chomp;
next unless /\w/;
my $script_label = "";
for my $script ( qw/Arabic Greek Hebrew/ ) {
$script_label .= " has $script" if ( /\p{$script}/ );
}
print $_, $/;
print " -> ", length, " characters long; $script_label", $/;
print " -> ", encode_entities_numeric($_), $/;
}
__DATA__
antennæ
עברית
Ελληνικά
العَرَبِية
The output I got from that was:
antennæ
-> 7 characters long;
-> antennæ
עברית
-> 5 characters long; has Hebrew
-> עברית
Ελληνικά
-> 8 characters long; has Greek
-> Ελληνικά
العَرَبِية
-> 10 characters long; has Arabic
-> العَرَبِية
To put that another way, you can match and store strings of characters in particular, language-specific scripts with something like this:
# Assuming $_ contains the input:
my @hebrew_parts = /\p{Hebrew}+/g;
my @arabic_parts = /\p{Arabic}+/g;
my @greek_parts = /\p{Greek}+/g;
Similarly for Han, Cyrillic, Ethiopic, Thai, Devanagari, etc. (As shown above, you have the option of parameterizing the script label as a loop variable.) | [reply] [d/l] |