good chemistry is complicated,
and a little bit messy -LW
Re^3: Curious about Perl's strengths in 2018by raiph (Deacon)
|on May 20, 2018 at 19:48 UTC||Need Help??|
Updated in 2020 Switched language name to Raku. My perspective is that Raku is in the Perl family, and when I wrote this comment in 2018 it used the family name. But it's now Raku.
To keep my commentary as short as can reasonably do the topic justice, I've sharply narrowed discussion to: characters in a Unicode string; Raku; and Python 3. See my notes at the end for discussion of this narrowing.
What's a character?
For a while last century, "character", in the context of computing, came close to being synonymous with an ASCII byte.
But that was always a string implementation detail, one that involves an assumption that's broken in the general case. A character is not an ASCII byte unless you stick to a very limited view of text that ignores most of the world's text including even English text if it includes arbitrary Unicode characters, eg tweets which may look English but are allowed to contain arbitrary characters.
For a while this century, "character", in the context of contemporary mainstream programming languages and developer awareness, has come close to being synonymous with a Unicode codepoint.
Unfortunately, assuming a codepoint is a character in the ordinary sense is again a broken assumption in the general case. Even if you're dealing with Unicode text, a character does not correspond to a Unicode codepoint, unless you continue to stick to a very sharply limited view of text and characters that again excludes arbitrary Unicode text.
"What a user thinks of as a character"
So, just what is a "character" given some Unicode string?
If we're talking about Unicode, it's helpful to consider Unicode's precisely chosen vocabulary for describing text, and in particular, characters.
Unicode's definition of "what a user thinks of as a character" translates, in digital terms, to it being a sequence of codepoints selected according to rules (algorithms) and data primarily defined by Unicode.
So a character might be just one codepoint -- or it might be many.
Text processing can't properly distinguish what characters there are in a text string unless it iterates through a given text string, calculating the start and end of individual characters according to the relevant general Unicode rules and data (and locale and/or application specific overrides).
This latter reality -- an individual character can be comprised of multiple codepoints -- is why a character=codepoint assumption is a 21st century mistake that's analogous to the 20th century one of assuming character=byte.
The codepoint=character assumption allows for fast indexing -- but it's increasingly often wrong, leading to broken code and corrupt data.
What Perl, Raku, and Python (2 + 3) think of as a character
Armed with the knowledge that Unicode uses the word "grapheme" to denote "what a user thinks of as a a character" rather than bytes and codepoints, one can begin to get some sense of the level of support for this level of character handling in any given programming language by searching within its resources for "grapheme".
Google searches for "grapheme+<prog-lang-web-home>" and "grapheme+<prog-lang-goes-here>", with commentary about the state of things when I did these searches in 2018:
An example that's No F💩💩king Good
Given the discussion thus far, it should come as no surprise that the built in string type, functions, and standard libraries of both Python 2 and Python 3 will yield the wrong result for string length, character indexing, and substrings, and functionality that relies on those results, if A) what you're interested in is character=grapheme processing as contrasted with character=codepoint processing and B) a string contains a grapheme that isn't a single codepoint.
One fun way to see this in action is to view Patrick Michaud's lightning talk about text processing that's No F💩💩king Good. If you don't have 5 minutes, the following link takes you right to the point where Patrick spends 30 seconds trying Python 3. Of the three simple tests used in his talk it gets two "wrong".
Part of the fun has emerged since I first wrote this perlmonks comment. It turns out that this example may itself be No F**king Good in a manner not at all intended by Jonathan Worthington who wrote the presentation or me when I originally included it here. Prompted by a reader who challenged several aspects of this post, including this one, my brief investigation thus far suggests that the specific example of a D with double dots is actually a "degenerate case" -- one that "never occurs in practice", or at least one that will generally only occur in artificial/accidental scenarios such as the test in the video.
(It looks like it may have been naively taken from the the "Basic Examples" table in Unicode annex #15 on the mistaken assumption it's not degenerate when instead (perhaps) it's in the table as an example in which normalization to a single character is not appropriate because that character doesn't appear in practice. If you, dear reader, can confirm or deny its degenerate nature, please comment.)
A better example?
Does the above No F💩💩king Good example mean the thrust of this post -- about character=grapheme vs character=codepoint -- is essentially invalid? No. While D with double dots may be an especially poorly chosen example, the problem does occur for a large number of non-degenerate characters.
Consider the reported length of the string "षि". This string contains text written in Devanagari, one of the world's most used scripts. When you try selecting the text using your web browser, how many characters appear to be inside the quotes? For me it's one.
The code print(len(unicodedata.normalize('NFC',u'षि'))), when run in Python 2 and Python 3, returns 2. The code say 'षि'.chars when run in Raku returns 1. The code is simpler, and, much more importantly, correct.
Or is it correct? To further complicate matters, the number of graphemes in a string sometimes depends on the particular text, the human looking at it (this is not a joke), and the application context! For further insight into this read a reddit exchange.
Raku has not yet addressed tailored grapheme clusters in its core. So, while it's much easier to use than Perl and Python 3 for many cases, it's still got work to do.
Does it matter that Raku's character and substring accessing time is O(1)?
If you watch the whole of Patrick's (5 minute) talk you'll see he covers the point that Raku has "O(1) substring, index, etc.".
But for most things, other langs are faster than Raku -- a lot faster. So does O(1) indexing matter?
Imo it does. It took years to get the architecture of Raku and the Rakudo compiler right, but the initial decade of design work is now in the past. NFG, along with all the other innovative and/or difficult main elements in Raku and Rakudo, are in place and getting steadily better and faster.
If character processing in general matters, then presumably O(1) character indexing, substring processing, and regexing matters. And if so, the Raku and nqp languages, and Rakudo / NQP / MoarVM compiler stack, are all in a great place given that they're the first (and I believe only) programming languages and compiler stack in the world with O(1) performance.
(As far as I know the indexing, substring and regexing performance of Swift and Elixir -- the only other languages I'm aware of that have adopted "what a user thinks of as a character" as their standard string type's character abstraction -- is still O(n) or worse.)
What about third-party add-ons for this functionality in Python?
The primary source of guidance, reference implementations, and locale specific data related to Unicode, including annex #29, is ICU (code in C/C++ and Java) and CLDR (locale specific data related to text segmentation, including of characters). Many languages rely on bindings/wrappers of these resources for much of their Unicode support.
In the Python case the PyICU project is a binding/wrapper with a long history that credibly (to me, just an onlooker) claims production status.
I'm unsure about the status of other projects. The pure Python uniseg includes a PR and reply to that PR from this year but hasn't been updated since 2015, since which Unicode has substantially updated annex #29 in ways that require conforming implementations to change. Another simpler but newer library is grapheme as introduced in this blog post. In some ways this is the most promising library I found. That said, it's currently marked as Alpha status.
Note that neither PyICU nor uniseg nor grapheme provides anything remotely like the ergonomic simplicity and deep integration that the Raku language provides for character=grapheme indexing, substring handling, regexing, etc.
Furthermore, ICU, and thus any modules that build directly on its code -- which I believe is true of PyICU, uniseg and grapheme -- does not provide O(1) grapheme-based indexing, substring and regexing performance. (cf the grapheme library's comment that "Execution times may improve in later releases, but calculating graphemes is and will continue to be notably slower than just counting unicode code points".)
Perhaps my overall point has gotten lost as I've tried to provide substantive detail.
The bottom line is that Perl has long been a leader in text processing capabilities and in that regard, as in many others, it's in great shape, including and perhaps especially in how it compares with Python.
Sorry it took me so long to spot your reply and write this comment. (And because of that I'm not going to simultaneously start another sub-thread about another topic as I originally said I would if you replied. Let's see if you spot this reply and then maybe we can wrap this sub-thread first and only start another if we're both interested in doing so.)
To keep my commentary as short as can reasonably do the topic justice, I sharply narrowed discussion above to characters in a Unicode string; Perl 6; and Python 3: