Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: incorrect length of strings with diphthongs

by cavac (Vicar)
on Aug 30, 2022 at 15:02 UTC ( #11146499=note: print w/replies, xml ) Need Help??


in reply to Re: incorrect length of strings with diphthongs
in thread incorrect length of strings with diphthongs

I may be mistaken, but aren't there (at least) two ways to encode an Umlaut in Unicode? You could either use the dedicated character Ü or combine the letter U with with the diacritic character ¨

So the word "Hütte" could be 6 letters (unicode symbols) long, depending on the exact encoding and how length() is implemented? Not sure, just looking at Wikipedia: https://en.wikipedia.org/wiki/Combining_character

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
  • Comment on Re^2: incorrect length of strings with diphthongs

Replies are listed 'Best First'.
Re^3: incorrect length of strings with diphthongs
by choroba (Cardinal) on Aug 30, 2022 at 15:42 UTC
    That's true:
    #!/usr/bin/perl use warnings; use strict; use Unicode::Normalize qw{ normalize }; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; print normalize($_, $char), ' ' for qw( D C );

    Running the output through xxd:

    00000000: 75cc 8820 c3bc 20 u.. ..

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^3: incorrect length of strings with diphthongs
by LanX (Sage) on Aug 30, 2022 at 17:34 UTC
    Yes, I'd say it's similar with the "ethnic" modifiers of face emojis.

    But my expectation is that those modifiers don't count as character and have length 0, i.e. "Hütte" should have length 5 in both incarnations.

    > how length() is implemented?

    I may be wrong tho...

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      #!/usr/bin/perl use strict; use feature qw{ say }; use warnings; use Unicode::Normalize qw{ normalize }; use Unicode::GCString; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; for (qw( D C )) { my $n = normalize($_, $char); my $gcs = 'Unicode::GCString'->new($n); say join ' ', length($n), $n =~ s/(\X)/$1/g, $1, $gcs->chars, $gcs->columns, $gcs->length; }
      2 1 ü 2 1 1
      1 1 ü 1 1 1
      

      Update: Added the output.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Interesting, looks like code.

        I might even be able to install those modules and try to understand the output you didn't provide (yet)!

        ;-P

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

      > I may be wrong tho...

      I certainly am...

      #!/usr/bin/perl use v5.12; use strict; use utf8; use Devel::Peek; my $trema = "\N{COMBINING DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; my $huette = "Hu${trema}tte"; Dump $huette; say "$huette\'s length: ". length($huette);

      SV = PV(0x25f4a58) at 0x25266b8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x28da368 "Hu\314\210tte"\0 [UTF8 "Hu\x{308}tte"] CUR = 7 LEN = 10 Hütte's length: 6
      That's how it looks like without codetags:

      Hütte's length: 6

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re^3: incorrect length of strings with diphthongs
by LanX (Sage) on Aug 30, 2022 at 21:15 UTC
    It's a matter of debate if u + ¨ is an umlaut, that's really depending on the definition of umlaut.

    Interestingly it's possible to combine ü + ¨ to pile up tremas

    Hü̈tte Hü̈̈tte

    I see huts with smoking chimneys ;-)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      There's also ű in Hungarian (called "double acute" in Unicode).

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Using two bars instead of dots is acceptable in German, there are fonts which realize the umlauts ä,ö,ü this way. (see also this )

        That's because of the weird form of Kurrent's small e being written above the vowels.

        Apparently this is also connected to the history of the Czech letter Ů ů with an o superscript.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11146499]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2023-03-22 20:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (60 votes). Check out past polls.

    Notices?