http://qs321.pair.com?node_id=1220390


in reply to Strange Unicode normalization question

but I have no evidence of NonspacingMark ever being in the normalized string.

There are three in the example you gave:

use strict;
use warnings qw( all );
use feature qw( say );

use utf8;
use open ':std', ':encoding(UTF-8)';

use charnames          qw( );
use Unicode::Normalize qw( NFKD );

my $html = "Lubomír,Bartoňová";
my $decomposedHtml = NFKD( $html );
say charnames::viacode(ord($_))
   for $decomposedHtml =~ /(\p{NonspacingMark})/g;

Output:

COMBINING ACUTE ACCENT COMBINING CARON COMBINING ACUTE ACCENT

The code you posted is a hack to find an ASCII "equivalent" to the input.

Replies are listed 'Best First'.
Re^2: Strange Unicode normalization question
by mje (Curate) on Aug 16, 2018 at 18:10 UTC

    Thanks again. This code was a bit of a mess and your comments and the others have helped me see what was going wrong. I appologise for now providing better information but there was a lot of code for something which should have been quite simple. This is what the original code did:

    1. Opened data file with encoding(UTF-8)
    2. Read a line of comma separated strings from it and split them on the comma
    3. Put the split fields into a hash with keys describing the data
    4. Passed to hash to a hand written function that tried to produce a x-url-formencoded string but this function was broken and instead just stuck an '&' between each key=value so it wasn't form encoded at all
    5. Passed the resulting string into NFKD and did the substitution as I described earlier
    6. Passed the resulting string into encode to encode as UTF-8
    7. Passed the resulting string into a LWP POST

    So it was horribly broken because it did not form encode properly and then NFKD was a workaround he discovered which I suspect only works because the API does normalization itself (which would not surprise me). I replaced the hand written (incorrect) form encoding with WWW::Form::UrlEncoded build_urlencoded and as you both state the NFKD is a noop as is the substitution and and it works. This was confused because it appears when it didn't work originally (without the NFKD) he was told by the API support to turn diacritics into normal characters. The actual code was a lot more complicated than this and the more I looked at it the more problems I found so I've spent most of the day rewriting it.

    Thanks again for your insights.

Re^2: Strange Unicode normalization question
by Veltro (Hermit) on Aug 16, 2018 at 11:43 UTC

    Your answer makes sense, however the OP says $html is the url-encoded strings which I interpret as:

    ... my $html = "Lubom%C3%ADr%2CBarto%C5%88ov%C3%A1" ; my $decomposedHtml = NFKD( $html ); ...

    Which doesn't make sense to me...

      They also said "Lubomír,Bartoňová" is passed through NFKD, which is that part I addressed.

      The OP wrote a lot, but said very little that can be used. I didn't think that asking for a clearer explanation would be useful, so I provide a starting point.

      I do agree that it makes no sense to pass URL-encoded or HTML-encoded text to NFKD. Escapes could prevent it from functioning correctly.

Re^2: Strange Unicode normalization question
by mje (Curate) on Aug 16, 2018 at 17:35 UTC

    Thank you. I had not understood what is happening and I think I do now. The NFKD is separating the í into 2 characters and the substitution is removing the 2nd one (the NonspacingMark) leaving an i (similarly for other 2). So "Lubomír,Bartoňová" becomes "Lubomir,Bartonova". Unless this is done there is no match for the combination of strings which include this name (BTW, the name is fictitious - I should have mentioned that). I am at a loss as to why we need to do this via this API which from a UK government organisation but there is no documentation saying this must be done.

      I have no clue what the most stable/robust answer is here but I thought this belonged in the footnotes of the thread: Text::Unidecode. I have used it for normalizing search indexes such that a user typing Francois finds François; maybe similar to your use case.