comment on

While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with *an* encoding at input and *an* encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc.

This little snippet might get you started. I had to use <pre/> tags because PM's <code/> tags don't like wide characters. :P

use utf8;
use strictures;
use HTML::Entities "encode_entities_numeric";

binmode STDOUT, ":encoding(UTF-8)";
# OR use Encode, print encode_utf8(...)

while (<DATA>)
{
    chomp;
    next unless /\w/;
    print $_, $/;
    print "  -> ",  length, " characters long", $/;
    print "  -> ", encode_entities_numeric($_), $/;
}

__DATA__
antennć
עברית
Ελληνικά
العَرَبِية‎

antennć
  -> 7 characters long
  -> antenn&#xE6;
עברית
  -> 5 characters long
  -> &#x5E2;&#x5D1;&#x5E8;&#x5D9;&#x5EA;
Ελληνικά
  -> 8 characters long
  -> &#x395;&#x3BB;&#x3BB;&#x3B7;&#x3BD;&#x3B9;&#x3BA;&#x3AC;
العَرَبِية‎
   -> 11 characters long
   -> &#x627;&#x644;&#x639;&#x64E;&#x631;&#x64E;&#x628;&#x650;&#x64A;&#x629;&#x200E;

Further reading: Encode, utf8, perlunitut. Branch out from those as desired.

In reply to Re: Unicode words match and catch by Your Mother
in thread Unicode words match and catch by kepler

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks