Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

You have written some Perl scripts already, and when somebody asks you how to reverse a string, you'll answer: "That's easy, just call reverse in scalar context".

And of course, you're right - if you're only considering ASCII chars.

But suppose you have an UTF-8 environment:

#!/usr/bin/perl use strict; use warnings; print scalar reverse "\nou";

The output consists of a "u", two garbage characters, and a newline.

The reason is that "", like many other chars, is represented by several bytes in UTF-8, here as 0xC3 0xA4. reverse Works on bytes, so it will produce 0xA4< 0xC3. And that is not legal UTF-8, so the output contains two bytes of garbage.

You can solve this problem by decoding the text strings (read perluniintro and perlunicode for more information):

#!/usr/bin/perl use strict; use warnings; use utf8; binmode STDOUT, ':utf8'; print scalar reverse "\nou"; __END__ uo

The use utf8; takes care that every string literal in the script is treated as a text string, so reverse (and other functions like uc) will work on codepoint level.

While this example worked, it could just as well fail.

The reason is that there are multiple ways to encode some characters.

Consider the letter "", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS.

Converting one representation into the other is called "Unicode normalization".

Bad luck, in our case, reverse doesn't work for the normalized form:

#!/usr/bin/perl use strict; use warnings; use utf8; use Unicode::Normalize; use charnames ':full'; my $str = ''; sub mydump { print map { '\N['. charnames::viacode(ord $_) . ']' } split m//, $_[0]; print "\n"; } mydump $str; mydump NFKD($str); mydump scalar reverse NFKD($str); binmode STDOUT, ':utf8'; my $tmp = "\nO"; print scalar reverse NFKD($tmp); __END__ \N[LATIN CAPITAL LETTER A WITH DIAERESIS] \N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS] \N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETTER A] A

You can see that reversing a string moves the combining character(s) to the front, thus they are applied to the wrong base character; "O" reversed becomes "A".

(You wouldn't normalize with NFKD here under normal circumstances, in this example it is done to demonstrate the problem that can arise from such strings).

It seems the problem could easily be solved by not doing the normalization in the first place, and indeed that works in this example. But there are Unicode graphemes that can't be expressed with a single Codepoint, and if one of your users enters such a grapheme, your application won't work correctly.

So we need a "real" solution. Since perl doesn't work with graphemes, we need a CPAN module that does:

#!/usr/bin/perl use strict; use warnings; use utf8; use Unicode::Normalize; use charnames ':full'; use String::Multibyte; my $str = NFKD "O"; sub mydump { print map { '\N['. charnames::viacode(ord $_) . ']' } split m//, $_[0]; print "\n"; } my $u = String::Multibyte->new('Grapheme'); mydump $str; mydump $u->strrev($str); binmode STDOUT, ':utf8'; print $u->strrev($str), "\n"; __END__ \N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETT +ER O] \N[LATIN CAPITAL LETTER O]\N[LATIN CAPITAL LETTER A]\N[COMBINING DIAER +ESIS] O

The String::Multibyte::Grapheme module helps you with reversing the string without tearing the graphemes apart.

(It can also count the number of graphemes, generate substrings with grapheme semantics and more; see String::Multibyte.)

In reply to How to reverse a (Unicode) string by moritz

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (4)
    As of 2020-09-20 09:11 GMT
    Find Nodes?
      Voting Booth?
      If at first I dont succeed, I

      Results (120 votes). Check out past polls.