Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

incorrect length of strings with diphthongs

by tos (Deacon)
on Aug 25, 2022 at 13:07 UTC ( [id://11146409]=perlquestion: print w/replies, xml ) Need Help??

tos has asked for the wisdom of the Perl Monks concerning the following question:

Hi, for reasons of formatting the output i would like to get the correct character-count of string even if they contain diphthongs. For the following oneliner i would expect 65 and not 66. What am i doing wrong?
# perl -we 'use Unicode::GCString;use Unicode::Normalize;$t="Hütte";pr +int length("$t");$g=Unicode::GCString->new("$t");print $g->columns' 66
Gruß Thomas

Is simplicity best or simply the easiest Martin L. Gore

Replies are listed 'Best First'.
Re: incorrect length of strings with diphthongs
by jeffenstein (Hermit) on Aug 25, 2022 at 14:04 UTC

    You need to add 'use utf8;'

    perl -we 'use Unicode::GCString;use Unicode::Normalize;$t="Hütte";prin +t length("$t");$g=Unicode::GCString->new("$t");print $g->columns' 66

    perl -we 'use utf8;use Unicode::GCString;use Unicode::Normalize;$t="Hü +tte";print length("$t");$g=Unicode::GCString->new("$t");print $g->col +umns' 55

    IIRC, the interpreter always uses ISO-8859-1 for the script unless you add 'use utf8;'

Re: incorrect length of strings with diphthongs
by kcott (Archbishop) on Aug 25, 2022 at 14:28 UTC

    G'day tos,

    Adding some additional code to print values and improve output, and a -C (see "perlrun: -C") so I don't see garbled output, I get:

    $ perl -C -wE 'use Unicode::GCString;use Unicode::Normalize;$t="Hütte" +; say "\$t[$t]"; print length("$t"), "\n";$g=Unicode::GCString->new(" +$t"); say "\$g[$g]"; print $g->columns, "\n"; say $g->chars;'
    $t[Hütte]
    6
    $g[Hütte]
    6
    6
    

    If I then tell Perl that the source code is written in UTF-8 (use utf8;):

    $ perl -C -wE 'use utf8; use Unicode::GCString;use Unicode::Normalize; +$t="Hütte"; say "\$t[$t]"; print length("$t"), "\n";$g=Unicode::GCStr +ing->new("$t"); say "\$g[$g]"; print $g->columns, "\n"; say $g->chars +;'
    $t[Hütte]
    5
    $g[Hütte]
    5
    5
    

    Both of those outcomes seem reasonable to me. Does that help you at all? If not, please explain why you were expecting a 6 then a 5.

    — Ken

Re: incorrect length of strings with diphthongs
by LanX (Saint) on Aug 25, 2022 at 14:02 UTC
    I'm not sure what your goal is...

    "Hütte" has 5 unicode characters but no diphthongs:

    use v5.12; use warnings; use utf8; # treat source-code as utf8 including string-literals my $t = "Hütte"; say length($t); # 5

    FWIW:
    "ü" is not a diphthong but an umlaut. The transcription "ue" neither, because it's still just pronounced as one vowel not two.

    Compare "au" (eg "Braun") which is a diphthong (di=two)

    Umlauts are not alien to the English language, there are just no formalized characters for it.

    Compare the switch from "foot" to "feet", or "mouse" to "mice"

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    Update

    Restructuré

      "Diphthong" may be a thinko for "diacritic."

      And the Unicode Consortium calls those two dots a "diaresis." A German umlaut looks the same, but has the function, more or less, of appending an "e" to the marked vowel. A diaresis, in languages I know that use it (English and Spanish) is placed over the second vowel to indicate that it is not participating in a diphthong but pronounced separately. They seem to be much less used these days in English, but in times past you wrote "coöperate" to indicate that the word was "co-op-er-ate", not "coop-er-ate."

      Sorry for the grammar pedantry, but on the chance the OP was not a native English speaker I thought I would try to clarify the terminology, even though we all know what you meant.

        > A German umlaut looks the same, but has the function, more or less, of appending an "e" to the marked vowel.

        Less or more, there are three things called Umlaut

        1. the two points, aka diaresis or trema (Anglo-Saxons)
        2. the vowels ä,ö,ü (Germans)
        3. the phonetic phenomen (Liguists)

        see Umlaut_(disambiguation)

        Umlauts in German were originally denoted by a superscript e written above and the small e degraded to two points². But that doesn't mean appending an e in the sense of a diphthong.

        The Proto-Germanic words for "foot/feet" (DE: Fuß/Füße) was something like "fōts/fōtiz" without sound alteration of the first vowel.

        At some point people where too lazy and assimilated the back-vowel "u" to the following "i", i.e. the mouth and lips still formed "oo" while pronouncing "ee".

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        drivel ...

        Interestingly does the tendency to pronounce the "ü" that way seem to depend on European regions, the French and Dutch pronunciation of "u" is pretty much like the German "ü", eastern European varieties of German especially Yiddish lose them again ("Fis" or "Fisle"° for "feet")

        °) where the "le" is a diminutive characteristic for South-German dialects, compare "Müesli" from Switzerland, interestingly with a "üe" diphthong which doesn't exist in Standard German. Germans will say "Müsli" which in turn means "little Mouse" in Swiss-German xD

        ²) or two vertical bars. The "e" in Kurrent looks similar to '11', no idea why.

        They seem to be much less used these days in English, but in times past you wrote "coöperate"

        I am a native speaker of English and have to agree that they are not seen as much as previously. I put this down to very poor support for any sort of accents in word-processing software aimed at the English-speaking market up until maybe 10 years ago. However, I must also say that I don't recall ever seeing a diaeresis in coöperate, although plenty of times I have seen it with a hyphen to obtain the same effect, ie. "co-operate".

        Some words still look strange to me when I see them unadorned such as: Noël, naïve, Zoë, etc. Perhaps that too will fade with time.


        🦛

        What hippo said. I write résumé, naïve, piñon, façade, antennæ, and such because I’m a typographer by past trade and have always used Macs where it’s easy to find such things. I don’t remember ever seeing “coöperate,” even in historical text.

        "Diphthong" may be a thinko for "diacritic.", that was my unforgivable fault. ;-)

        But therefore i learned the neat new word "thinko". :-)


        Is simplicity best or simply the easiest Martin L. Gore

      I may be mistaken, but aren't there (at least) two ways to encode an Umlaut in Unicode? You could either use the dedicated character Ü or combine the letter U with with the diacritic character ¨

      So the word "Hütte" could be 6 letters (unicode symbols) long, depending on the exact encoding and how length() is implemented? Not sure, just looking at Wikipedia: https://en.wikipedia.org/wiki/Combining_character

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
        That's true:
        #!/usr/bin/perl use warnings; use strict; use Unicode::Normalize qw{ normalize }; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; print normalize($_, $char), ' ' for qw( D C );

        Running the output through xxd:

        00000000: 75cc 8820 c3bc 20 u.. ..

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Yes, I'd say it's similar with the "ethnic" modifiers of face emojis.

        But my expectation is that those modifiers don't count as character and have length 0, i.e. "Hütte" should have length 5 in both incarnations.

        > how length() is implemented?

        I may be wrong tho...

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        It's a matter of debate if u + ¨ is an umlaut, that's really depending on the definition of umlaut.

        Interestingly it's possible to combine ü + ¨ to pile up tremas

        Hü̈tte Hü̈̈tte

        I see huts with smoking chimneys ;-)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: incorrect length of strings with diphthongs
by vr (Curate) on Aug 25, 2022 at 22:59 UTC

    Accidentally, you'll get 65 for uppercase input, but not for the reason it seems you expect (which seems "a plain letter of any case followed by zero-width combining diaeresis". You didn't provide Perl with such input, nor you should as other answers have already pointed out, and rarely can, but see further). It just happens, that among 6 octets of utf-8 encoded input, the latter of the pair representing uppercase "Ü" ("\xC3\x9C") belongs to 0x80..0x9F range, which Unicode::GCString considers to have zero width. A few others of utf-8 encoded extended Latin would also demonstrate "false positive" "correct" result. But not lowercase "ü" -- either "alas" or "luckily" depends on viewpoint.

    For correct but unnecessary "plain letter followed by combining diaeresis" and expected 2 unequal numbers output, your input could have contained "u\x{0308}", or NFD "\N{U+00fc}", or NFD 'ü' under use utf8;, or NFD "\N{LATIN SMALL LETTER U WITH DIAERESIS}", etc. For reverse cure, assuming it's required at all on top of correct decoding, I'd expect Unicode::Normalize::NFC to be of use but Unicode::GCString unnecessary for simple plain or extended Latin, but YMMV.

Re: incorrect length of strings with diphthongs
by tos (Deacon) on Aug 26, 2022 at 08:56 UTC

    Thanks to Rolf, jeffenstein, Ken and vr for the profound explanations.

    A further reason for always trying to duck out from unicode whenever i could. Its interesting that even in french the term "Umlaut" is known, if one can believe dict.cc.

    Learning never stops :-)


    Is simplicity best or simply the easiest Martin L. Gore

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11146409]
Front-paged by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (8)
As of 2024-04-19 08:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found