Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^2: Unicode vulgar fraction composition

by tobyink (Canon)
on Sep 26, 2020 at 09:55 UTC ( [id://11122236]=note: print w/replies, xml ) Need Help??


in reply to Re: Unicode vulgar fraction composition
in thread Unicode vulgar fraction composition

One way of thinking about it, in a simplified ASCII world, would be if you lowercased words to do a case comparison:

chomp( my $name = lc <$fh> ); if ( $name eq 'bob jones' ) { die 'rejecting annoying person'; } # Now I want to restore $name to its original mixture of upper and l +ower case

Replies are listed 'Best First'.
Re^3: Unicode vulgar fraction composition
by ikegami (Patriarch) on Sep 28, 2020 at 02:16 UTC

    Good analogy (though you really want fc instead of lc to perform a case-insensitive comparison).

      For ASCII, fc does the same thing as lc though. And I specified ASCII for that reason.

        I know. That doesn't change anything.

Re^3: Unicode vulgar fraction composition
by raygun (Scribe) on Oct 05, 2020 at 07:57 UTC

    Sure, I think it's intuitive why lc('Boaty McBoat') is conceptually a "lossy" transformation (in terms of being able to restore the original string).

    But NFKC("\N{VULGAR FRACTION THREE EIGHTHS}") is conceptually "lossless": there is only one Unicode character the resultant string "3\N{FRACTION SLASH}8" could be "composed" into.

    As I wrote, I get now why NFKC is conceptually lossy in general. But—unlike with lc—some specific decompositions are exceptions.

      consider:
      • 123\N{FRACTION SLASH}8
      • 12\N{VULGAR FRACTION THREE EIGHTHS}
      I would read the former as "one hundred twenty three eights", but the latter as "twelve (plus) three eights", so it's not completely a one-to-one relationship.

        Yes, my understanding is that's how Unicode would have you interpret each of those.

        So the problem then becomes that running NFKC on the latter produces the former: a nonequivalent string, therefore erroneous output. The correctly decomposed form of "12\N{VULGAR FRACTION THREE EIGHTHS}" would be, I presume, "12\N{ZERO WIDTH NON-JOINER}3\N{FRACTION SLASH}8". (Whether this is a bug or merely a "gotcha" in NFKC I suppose is a matter of interpretation.)

        But point taken that context matters when composing vulgar fractions.

      There's no way to know that 3/8 means three-eights. For example, it could mean March 8th. As such there are two possible compositions for 3/8: VULGAR FRACTION THREE EIGHTHS and 3/8.

        Absolutely true if (as you wrote) a U+002F SOLIDUS appears between the 3 and the 8. This is why I've been limiting my scope to the case where a U+2044 FRACTION SLASH appears between them, i.e., the specific sequence that NFKC or NFKD decomposes a Unicode vulgar fraction into.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11122236]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2024-04-20 14:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found