incorrect length of strings with diphthongs

tos has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: incorrect length of strings with diphthongs by jeffenstein (Hermit) on Aug 25, 2022 at 14:04 UTC
You need to add 'use utf8;' `perl -we 'use Unicode::GCString;use Unicode::Normalize;$t="Hütte";prin +t length("$t");$g=Unicode::GCString->new("$t");print $g->columns' 66` [download] `perl -we 'use utf8;use Unicode::GCString;use Unicode::Normalize;$t="Hü +tte";print length("$t");$g=Unicode::GCString->new("$t");print $g->col +umns' 55` [download] IIRC, the interpreter always uses ISO-8859-1 for the script unless you add 'use utf8;'	[reply] [d/l] [select]
Re: incorrect length of strings with diphthongs by kcott (Archbishop) on Aug 25, 2022 at 14:28 UTC
G'day tos, Adding some additional code to print values and improve output, and a `-C` (see "perlrun: -C") so I don't see garbled output, I get: `$ perl -C -wE 'use Unicode::GCString;use Unicode::Normalize;$t="Hütte" +; say "\$t[$t]"; print length("$t"), "\n";$g=Unicode::GCString->new(" +$t"); say "\$g[$g]"; print $g->columns, "\n"; say $g->chars;'` [download] $t[HÃ¼tte] 6 $g[HÃ¼tte] 6 6 If I then tell Perl that the source code is written in UTF-8 (`use utf8;`): `$ perl -C -wE 'use utf8; use Unicode::GCString;use Unicode::Normalize; +$t="Hütte"; say "\$t[$t]"; print length("$t"), "\n";$g=Unicode::GCStr +ing->new("$t"); say "\$g[$g]"; print $g->columns, "\n"; say $g->chars +;'` [download] $t[Hütte] 5 $g[Hütte] 5 5 Both of those outcomes seem reasonable to me. Does that help you at all? If not, please explain why you were expecting a `6` then a `5`. — Ken	[reply] [d/l] [select]
Re: incorrect length of strings with diphthongs by LanX (Saint) on Aug 25, 2022 at 14:02 UTC
I'm not sure what your goal is... "Hütte" has 5 unicode characters but no diphthongs: `use v5.12; use warnings; use utf8; # treat source-code as utf8 including string-literals my $t = "Hütte"; say length($t); # 5` [download] FWIW: "ü" is not a diphthong but an umlaut. The transcription "ue" neither, because it's still just pronounced as one vowel not two. Compare "au" (eg "Braun") which is a diphthong (di=two) Umlauts are not alien to the English language, there are just no formalized characters for it. Compare the switch from "foot" to "feet", or "mouse" to "mice" Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} Update Restructuré	[reply] [d/l]
Re^2: incorrect length of strings with diphthongs by Anonymous Monk on Aug 26, 2022 at 16:59 UTC
"Diphthong" may be a thinko for "diacritic." And the Unicode Consortium calls those two dots a "diaresis." A German umlaut looks the same, but has the function, more or less, of appending an "e" to the marked vowel. A diaresis, in languages I know that use it (English and Spanish) is placed over the second vowel to indicate that it is not participating in a diphthong but pronounced separately. They seem to be much less used these days in English, but in times past you wrote "coöperate" to indicate that the word was "co-op-er-ate", not "coop-er-ate." Sorry for the grammar pedantry, but on the chance the OP was not a native English speaker I thought I would try to clarify the terminology, even though we all know what you meant.	[reply]
Re^3: incorrect length of strings with diphthongs by LanX (Saint) on Aug 27, 2022 at 20:37 UTC
> A German umlaut looks the same, but has the function, more or less, of appending an "e" to the marked vowel. Less or more, there are three things called Umlaut the two points, aka diaresis or trema (Anglo-Saxons) the vowels ä,ö,ü (Germans) the phonetic phenomen (Liguists) see Umlaut_(disambiguation) Umlauts in German were originally denoted by a superscript e written above and the small e degraded to two points². But that doesn't mean appending an e in the sense of a diphthong. The Proto-Germanic words for "foot/feet" (DE: Fuß/Füße) was something like "fōts/fōtiz" without sound alteration of the first vowel. At some point people where too lazy and assimilated the back-vowel "u" to the following "i", i.e. the mouth and lips still formed "oo" while pronouncing "ee". Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} drivel ... Interestingly does the tendency to pronounce the "ü" that way seem to depend on European regions, the French and Dutch pronunciation of "u" is pretty much like the German "ü", eastern European varieties of German especially Yiddish lose them again ("Fis" or "Fisle"° for "feet") °) where the "le" is a diminutive characteristic for South-German dialects, compare "Müesli" from Switzerland, interestingly with a "üe" diphthong which doesn't exist in Standard German. Germans will say "Müsli" which in turn means "little Mouse" in Swiss-German xD ²) or two vertical bars. The "e" in Kurrent looks similar to '11', no idea why.	[reply]
Re^3: incorrect length of strings with diphthongs by hippo (Bishop) on Aug 26, 2022 at 21:48 UTC
They seem to be much less used these days in English, but in times past you wrote "coöperate" I am a native speaker of English and have to agree that they are not seen as much as previously. I put this down to very poor support for any sort of accents in word-processing software aimed at the English-speaking market up until maybe 10 years ago. However, I must also say that I don't recall ever seeing a diaeresis in coöperate, although plenty of times I have seen it with a hyphen to obtain the same effect, ie. "co-operate". Some words still look strange to me when I see them unadorned such as: Noël, naïve, Zoë, etc. Perhaps that too will fade with time. 🦛	[reply]
Re^3: incorrect length of strings with diphthongs by Your Mother (Archbishop) on Aug 27, 2022 at 00:17 UTC
What hippo said. I write résumé, naïve, piñon, façade, antennæ, and such because I’m a typographer by past trade and have always used Macs where it’s easy to find such things. I don’t remember ever seeing “coöperate,” even in historical text.	[reply]
Re^3: incorrect length of strings with diphthongs by tos (Deacon) on Aug 29, 2022 at 10:25 UTC
"Diphthong" may be a thinko for "diacritic.", that was my unforgivable fault. ;-) But therefore i learned the neat new word "thinko". :-) Is simplicity best or simply the easiest Martin L. Gore	[reply]
Re^2: incorrect length of strings with diphthongs by cavac (Parson) on Aug 30, 2022 at 15:02 UTC
I may be mistaken, but aren't there (at least) two ways to encode an Umlaut in Unicode? You could either use the dedicated character Ü or combine the letter U with with the diacritic character ¨ So the word "Hütte" could be 6 letters (unicode symbols) long, depending on the exact encoding and how length() is implemented? Not sure, just looking at Wikipedia: https://en.wikipedia.org/wiki/Combining_character PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP	[reply]
Re^3: incorrect length of strings with diphthongs by choroba (Cardinal) on Aug 30, 2022 at 15:42 UTC
That's true: `#!/usr/bin/perl use warnings; use strict; use Unicode::Normalize qw{ normalize }; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode STDOUT, ':encoding(UTF-8)'; print normalize($_, $char), ' ' for qw( D C );` [download] Running the output through xxd: `00000000: 75cc 8820 c3bc 20 u.. ..` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re^3: incorrect length of strings with diphthongs by LanX (Saint) on Aug 30, 2022 at 17:34 UTC
Yes, I'd say it's similar with the "ethnic" modifiers of face emojis. But my expectation is that those modifiers don't count as character and have length 0, i.e. "Hütte" should have length 5 in both incarnations. > how length() is implemented? I may be wrong tho... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^4: incorrect length of strings with diphthongs by choroba (Cardinal) on Aug 30, 2022 at 17:48 UTC
Re^5: incorrect length of strings with diphthongs by LanX (Saint) on Aug 30, 2022 at 20:06 UTC
Some notes below your chosen depth have not been shown here
Re^4: incorrect length of strings with diphthongs by LanX (Saint) on Aug 30, 2022 at 20:04 UTC
Re^3: incorrect length of strings with diphthongs by LanX (Saint) on Aug 30, 2022 at 21:15 UTC
It's a matter of debate if `u + ¨` is an umlaut, that's really depending on the definition of umlaut. Interestingly it's possible to combine `ü + ¨` to pile up tremas Hü̈tte Hü̈̈tte I see huts with smoking chimneys ;-) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^4: incorrect length of strings with diphthongs by choroba (Cardinal) on Aug 30, 2022 at 21:22 UTC
Re^5: incorrect length of strings with diphthongs by LanX (Saint) on Aug 30, 2022 at 22:13 UTC
Some notes below your chosen depth have not been shown here
Re: incorrect length of strings with diphthongs by vr (Curate) on Aug 25, 2022 at 22:59 UTC
Accidentally, you'll get 65 for uppercase input, but not for the reason it seems you expect (which seems "a plain letter of any case followed by zero-width combining diaeresis". You didn't provide Perl with such input, nor you should as other answers have already pointed out, and rarely can, but see further). It just happens, that among 6 octets of utf-8 encoded input, the latter of the pair representing uppercase "Ü" (`"\xC3\x9C"`) belongs to `0x80..0x9F` range, which `Unicode::GCString` considers to have zero width. A few others of utf-8 encoded extended Latin would also demonstrate "false positive" "correct" result. But not lowercase "ü" -- either "alas" or "luckily" depends on viewpoint. For correct but unnecessary "plain letter followed by combining diaeresis" and expected 2 unequal numbers output, your input could have contained `"u\x{0308}"`, or `NFD "\N{U+00fc}"`, or `NFD 'ü'` under `use utf8;`, or `NFD "\N{LATIN SMALL LETTER U WITH DIAERESIS}"`, etc. For reverse cure, assuming it's required at all on top of correct decoding, I'd expect `Unicode::Normalize::NFC` to be of use but `Unicode::GCString` unnecessary for simple plain or extended Latin, but YMMV.	[reply] [d/l] [select]
Re: incorrect length of strings with diphthongs by tos (Deacon) on Aug 26, 2022 at 08:56 UTC
Thanks to Rolf, jeffenstein, Ken and vr for the profound explanations. A further reason for always trying to duck out from unicode whenever i could. Its interesting that even in french the term "Umlaut" is known, if one can believe dict.cc. Learning never stops :-) Is simplicity best or simply the easiest Martin L. Gore	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks

incorrect length of strings with diphthongs

FWIW:

Update

drivel ...

Hü̈tte Hü̈̈tte