Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Unicode words match and catch

by kepler (Scribe)
on Apr 14, 2016 at 14:20 UTC ( #1160403=perlquestion: print w/replies, xml ) Need Help??

kepler has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to do a routine which can catch in a string all the words with an unicode for hebrew, greek and arabic. I'm trying to place those matches in an array. Then, item by item, I want to create a new array with the html hexadecimal entities of each word. I must admit I'm a bit lost here. Also, in the string, there might exist regular words in latin/english. Can someone please advise? Kind regards, Kepler

Replies are listed 'Best First'.
Re: Unicode words match and catch
by Your Mother (Archbishop) on Apr 14, 2016 at 15:17 UTC

    While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with *an* encoding at input and *an* encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc.

    This little snippet might get you started. I had to use <pre/> tags because PM's <code/> tags don't like wide characters. :P

    use utf8;
    use strictures;
    use HTML::Entities "encode_entities_numeric";
    
    binmode STDOUT, ":encoding(UTF-8)";
    # OR use Encode, print encode_utf8(...)
    
    while (<DATA>)
    {
        chomp;
        next unless /\w/;
        print $_, $/;
        print "  -> ",  length, " characters long", $/;
        print "  -> ", encode_entities_numeric($_), $/;
    }
    
    __DATA__
    antennŠ
    עברית
    Ελληνικά
    العَرَبِية‎
    
    antennŠ
      -> 7 characters long
      -> antenn&#xE6;
    עברית
      -> 5 characters long
      -> &#x5E2;&#x5D1;&#x5E8;&#x5D9;&#x5EA;
    Ελληνικά
      -> 8 characters long
      -> &#x395;&#x3BB;&#x3BB;&#x3B7;&#x3BD;&#x3B9;&#x3BA;&#x3AC;
    العَرَبِية‎
       -> 11 characters long
       -> &#x627;&#x644;&#x639;&#x64E;&#x631;&#x64E;&#x628;&#x650;&#x64A;&#x629;&#x200E;
    

    Further reading: Encode, utf8, perlunitut. Branch out from those as desired.

Re: Unicode words match and catch
by Corion (Pope) on Apr 14, 2016 at 14:24 UTC

    Wouldn't HTML::Entities fit the bill already, without the recognition of the particular alphabets?

Re: Unicode words match and catch
by graff (Chancellor) on Apr 15, 2016 at 02:52 UTC
    Adding to Your Mother's excellent advice above, you'll love the predefined unicode character classes for the various scripts. Here's a minor enhancement to the script provided above (again, using "pre" tags to avoid the mangling of non-ascii characters):
    #!/usr/bin/perl
    
    use utf8;
    use strictures;
    use HTML::Entities "encode_entities_numeric";
    
    binmode STDOUT, ":encoding(UTF-8)";
    # OR use Encode, print encode_utf8(...)
    
    while (<DATA>)
    {
        chomp;
        next unless /\w/;
        my $script_label = "";
        for my $script ( qw/Arabic Greek Hebrew/ ) {
            $script_label .= " has $script" if ( /\p{$script}/ );
        }
        print $_, $/;
        print "  -> ",  length, " characters long; $script_label", $/;
        print "  -> ", encode_entities_numeric($_), $/;
    }
    
    __DATA__
    antennŠ
    עברית
    Ελληνικά
    العَرَبِية
    
    The output I got from that was:
    antennŠ
      -> 7 characters long; 
      -> antennæ
    עברית
      -> 5 characters long;  has Hebrew
      -> עברית
    Ελληνικά
      -> 8 characters long;  has Greek
      -> Ελληνικά
    العَرَبِية
      -> 10 characters long;  has Arabic
      -> العَرَبِية
    
    To put that another way, you can match and store strings of characters in particular, language-specific scripts with something like this:
    # Assuming $_ contains the input: my @hebrew_parts = /\p{Hebrew}+/g; my @arabic_parts = /\p{Arabic}+/g; my @greek_parts = /\p{Greek}+/g;
    Similarly for Han, Cyrillic, Ethiopic, Thai, Devanagari, etc. (As shown above, you have the option of parameterizing the script label as a loop variable.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1160403]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2021-04-18 12:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?