Unicode words match and catch

kepler has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unicode words match and catch by Your Mother (Archbishop) on Apr 14, 2016 at 15:17 UTC
While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with an encoding at input and an encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc. This little snippet might get you started. I had to use `<pre/>` tags because PM's `<code/>` tags don't like wide characters. :P use utf8; use strictures; use HTML::Entities "encode_entities_numeric"; binmode STDOUT, ":encoding(UTF-8)"; # OR use Encode, print encode_utf8(...) while (<DATA>) { chomp; next unless /\w/; print $_, $/; print " -> ", length, " characters long", $/; print " -> ", encode_entities_numeric($_), $/; } __DATA__ antennć עברית Ελληνικά العَرَبِية‎ antennć -> 7 characters long -> antennæ עברית -> 5 characters long -> עברית Ελληνικά -> 8 characters long -> Ελληνικά العَرَبِية‎ -> 11 characters long -> العَرَبِية‎ Further reading: Encode, utf8, perlunitut. Branch out from those as desired.	[reply] [d/l] [select]
Re: Unicode words match and catch by Corion (Patriarch) on Apr 14, 2016 at 14:24 UTC
Wouldn't HTML::Entities fit the bill already, without the recognition of the particular alphabets?	[reply]
Re: Unicode words match and catch by graff (Chancellor) on Apr 15, 2016 at 02:52 UTC
Adding to Your Mother's excellent advice above, you'll love the predefined unicode character classes for the various scripts. Here's a minor enhancement to the script provided above (again, using "pre" tags to avoid the mangling of non-ascii characters): #!/usr/bin/perl use utf8; use strictures; use HTML::Entities "encode_entities_numeric"; binmode STDOUT, ":encoding(UTF-8)"; # OR use Encode, print encode_utf8(...) while (<DATA>) { chomp; next unless /\w/; my $script_label = ""; for my $script ( qw/Arabic Greek Hebrew/ ) { $script_label .= " has $script" if ( /\p{$script}/ ); } print $_, $/; print " -> ", length, " characters long; $script_label", $/; print " -> ", encode_entities_numeric($_), $/; } __DATA__ antennć עברית Ελληνικά العَرَبِية The output I got from that was: antennć -> 7 characters long; -> antennæ עברית -> 5 characters long; has Hebrew -> עברית Ελληνικά -> 8 characters long; has Greek -> Ελληνικά العَرَبِية -> 10 characters long; has Arabic -> العَرَبِية To put that another way, you can match and store strings of characters in particular, language-specific scripts with something like this: `# Assuming $_ contains the input: my @hebrew_parts = /\p{Hebrew}+/g; my @arabic_parts = /\p{Arabic}+/g; my @greek_parts = /\p{Greek}+/g;` [download] Similarly for Han, Cyrillic, Ethiopic, Thai, Devanagari, etc. (As shown above, you have the option of parameterizing the script label as a loop variable.)	[reply] [d/l]


Think about Loose Coupling
	PerlMonks