Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: incorrect length of strings with diphthongs

by LanX (Saint)
on Aug 25, 2022 at 14:02 UTC ( [id://11146412]=note: print w/replies, xml ) Need Help??


in reply to incorrect length of strings with diphthongs

I'm not sure what your goal is...

"Hütte" has 5 unicode characters but no diphthongs:

use v5.12; use warnings; use utf8; # treat source-code as utf8 including string-literals my $t = "Hütte"; say length($t); # 5

FWIW:
"ü" is not a diphthong but an umlaut. The transcription "ue" neither, because it's still just pronounced as one vowel not two.

Compare "au" (eg "Braun") which is a diphthong (di=two)

Umlauts are not alien to the English language, there are just no formalized characters for it.

Compare the switch from "foot" to "feet", or "mouse" to "mice"

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Update

Restructuré

Replies are listed 'Best First'.
Re^2: incorrect length of strings with diphthongs
by Anonymous Monk on Aug 26, 2022 at 16:59 UTC

    "Diphthong" may be a thinko for "diacritic."

    And the Unicode Consortium calls those two dots a "diaresis." A German umlaut looks the same, but has the function, more or less, of appending an "e" to the marked vowel. A diaresis, in languages I know that use it (English and Spanish) is placed over the second vowel to indicate that it is not participating in a diphthong but pronounced separately. They seem to be much less used these days in English, but in times past you wrote "coöperate" to indicate that the word was "co-op-er-ate", not "coop-er-ate."

    Sorry for the grammar pedantry, but on the chance the OP was not a native English speaker I thought I would try to clarify the terminology, even though we all know what you meant.

      > A German umlaut looks the same, but has the function, more or less, of appending an "e" to the marked vowel.

      Less or more, there are three things called Umlaut

      1. the two points, aka diaresis or trema (Anglo-Saxons)
      2. the vowels ä,ö,ü (Germans)
      3. the phonetic phenomen (Liguists)

      see Umlaut_(disambiguation)

      Umlauts in German were originally denoted by a superscript e written above and the small e degraded to two points². But that doesn't mean appending an e in the sense of a diphthong.

      The Proto-Germanic words for "foot/feet" (DE: Fuß/Füße) was something like "fōts/fōtiz" without sound alteration of the first vowel.

      At some point people where too lazy and assimilated the back-vowel "u" to the following "i", i.e. the mouth and lips still formed "oo" while pronouncing "ee".

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      drivel ...

      Interestingly does the tendency to pronounce the "ü" that way seem to depend on European regions, the French and Dutch pronunciation of "u" is pretty much like the German "ü", eastern European varieties of German especially Yiddish lose them again ("Fis" or "Fisle"° for "feet")

      °) where the "le" is a diminutive characteristic for South-German dialects, compare "Müesli" from Switzerland, interestingly with a "üe" diphthong which doesn't exist in Standard German. Germans will say "Müsli" which in turn means "little Mouse" in Swiss-German xD

      ²) or two vertical bars. The "e" in Kurrent looks similar to '11', no idea why.

      They seem to be much less used these days in English, but in times past you wrote "coöperate"

      I am a native speaker of English and have to agree that they are not seen as much as previously. I put this down to very poor support for any sort of accents in word-processing software aimed at the English-speaking market up until maybe 10 years ago. However, I must also say that I don't recall ever seeing a diaeresis in coöperate, although plenty of times I have seen it with a hyphen to obtain the same effect, ie. "co-operate".

      Some words still look strange to me when I see them unadorned such as: Noël, naïve, Zoë, etc. Perhaps that too will fade with time.


      🦛

      What hippo said. I write résumé, naïve, piñon, façade, antennæ, and such because I’m a typographer by past trade and have always used Macs where it’s easy to find such things. I don’t remember ever seeing “coöperate,” even in historical text.

      "Diphthong" may be a thinko for "diacritic.", that was my unforgivable fault. ;-)

      But therefore i learned the neat new word "thinko". :-)


      Is simplicity best or simply the easiest Martin L. Gore
Re^2: incorrect length of strings with diphthongs
by cavac (Parson) on Aug 30, 2022 at 15:02 UTC

    I may be mistaken, but aren't there (at least) two ways to encode an Umlaut in Unicode? You could either use the dedicated character Ü or combine the letter U with with the diacritic character ¨

    So the word "Hütte" could be 6 letters (unicode symbols) long, depending on the exact encoding and how length() is implemented? Not sure, just looking at Wikipedia: https://en.wikipedia.org/wiki/Combining_character

    PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
      That's true:
      #!/usr/bin/perl use warnings; use strict; use Unicode::Normalize qw{ normalize }; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; print normalize($_, $char), ' ' for qw( D C );

      Running the output through xxd:

      00000000: 75cc 8820 c3bc 20 u.. ..

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Yes, I'd say it's similar with the "ethnic" modifiers of face emojis.

      But my expectation is that those modifiers don't count as character and have length 0, i.e. "Hütte" should have length 5 in both incarnations.

      > how length() is implemented?

      I may be wrong tho...

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        #!/usr/bin/perl use strict; use feature qw{ say }; use warnings; use Unicode::Normalize qw{ normalize }; use Unicode::GCString; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; for (qw( D C )) { my $n = normalize($_, $char); my $gcs = 'Unicode::GCString'->new($n); say join ' ', length($n), $n =~ s/(\X)/$1/g, $1, $gcs->chars, $gcs->columns, $gcs->length; }
        2 1 ü 2 1 1
        1 1 ü 1 1 1
        

        Update: Added the output.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        > I may be wrong tho...

        I certainly am...

        #!/usr/bin/perl use v5.12; use strict; use utf8; use Devel::Peek; my $trema = "\N{COMBINING DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; my $huette = "Hu${trema}tte"; Dump $huette; say "$huette\'s length: ". length($huette);

        SV = PV(0x25f4a58) at 0x25266b8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x28da368 "Hu\314\210tte"\0 [UTF8 "Hu\x{308}tte"] CUR = 7 LEN = 10 Hütte's length: 6
        That's how it looks like without codetags:

        Hütte's length: 6

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

      It's a matter of debate if u + ¨ is an umlaut, that's really depending on the definition of umlaut.

      Interestingly it's possible to combine ü + ¨ to pile up tremas

      Hü̈tte Hü̈̈tte

      I see huts with smoking chimneys ;-)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        There's also ű in Hungarian (called "double acute" in Unicode).

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11146412]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-26 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found