Re: Strange Unicode normalization question

in reply to Strange Unicode normalization question

but I have no evidence of NonspacingMark ever being in the normalized string.

There are three in the example you gave:

use strict;
use warnings qw( all );
use feature qw( say );

use utf8;
use open ':std', ':encoding(UTF-8)';

use charnames          qw( );
use Unicode::Normalize qw( NFKD );

my $html = "Lubomír,Bartoňová";
my $decomposedHtml = NFKD( $html );
say charnames::viacode(ord($_))
   for $decomposedHtml =~ /(\p{NonspacingMark})/g;

Output:

COMBINING ACUTE ACCENT
COMBINING CARON
COMBINING ACUTE ACCENT
[download]

The code you posted is a hack to find an ASCII "equivalent" to the input.

Comment on Re: Strange Unicode normalization question Download Code

Replies are listed 'Best First'.
Re^2: Strange Unicode normalization question by mje (Curate) on Aug 16, 2018 at 18:10 UTC
Thanks again. This code was a bit of a mess and your comments and the others have helped me see what was going wrong. I appologise for now providing better information but there was a lot of code for something which should have been quite simple. This is what the original code did: Opened data file with encoding(UTF-8) Read a line of comma separated strings from it and split them on the comma Put the split fields into a hash with keys describing the data Passed to hash to a hand written function that tried to produce a x-url-formencoded string but this function was broken and instead just stuck an '&' between each key=value so it wasn't form encoded at all Passed the resulting string into NFKD and did the substitution as I described earlier Passed the resulting string into encode to encode as UTF-8 Passed the resulting string into a LWP POST So it was horribly broken because it did not form encode properly and then NFKD was a workaround he discovered which I suspect only works because the API does normalization itself (which would not surprise me). I replaced the hand written (incorrect) form encoding with WWW::Form::UrlEncoded build_urlencoded and as you both state the NFKD is a noop as is the substitution and and it works. This was confused because it appears when it didn't work originally (without the NFKD) he was told by the API support to turn diacritics into normal characters. The actual code was a lot more complicated than this and the more I looked at it the more problems I found so I've spent most of the day rewriting it. Thanks again for your insights.	[reply]
Re^2: Strange Unicode normalization question by Veltro (Hermit) on Aug 16, 2018 at 11:43 UTC
Your answer makes sense, however the OP says $html is the url-encoded strings which I interpret as: `... my $html = "Lubom%C3%ADr%2CBarto%C5%88ov%C3%A1" ; my $decomposedHtml = NFKD( $html ); ...` [download] Which doesn't make sense to me...	[reply] [d/l]
Re^3: Strange Unicode normalization question by ikegami (Patriarch) on Aug 16, 2018 at 14:43 UTC
They also said "Lubomír,Bartoňová" is passed through NFKD, which is that part I addressed. The OP wrote a lot, but said very little that can be used. I didn't think that asking for a clearer explanation would be useful, so I provide a starting point. I do agree that it makes no sense to pass URL-encoded or HTML-encoded text to `NFKD`. Escapes could prevent it from functioning correctly.	[reply] [d/l]
Re^2: Strange Unicode normalization question by mje (Curate) on Aug 16, 2018 at 17:35 UTC
Thank you. I had not understood what is happening and I think I do now. The NFKD is separating the í into 2 characters and the substitution is removing the 2nd one (the NonspacingMark) leaving an i (similarly for other 2). So "Lubomír,Bartoňová" becomes "Lubomir,Bartonova". Unless this is done there is no match for the combination of strings which include this name (BTW, the name is fictitious - I should have mentioned that). I am at a loss as to why we need to do this via this API which from a UK government organisation but there is no documentation saying this must be done.	[reply]
Re^3: Strange Unicode normalization question by Your Mother (Archbishop) on Aug 16, 2018 at 17:46 UTC
I have no clue what the most stable/robust answer is here but I thought this belonged in the footnotes of the thread: Text::Unidecode. I have used it for normalizing search indexes such that a user typing Francois finds François; maybe similar to your use case.	[reply]

In Section Seekers of Perl Wisdom