Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Strange Unicode normalization question

by mje (Curate)
on Aug 15, 2018 at 17:37 UTC ( [id://1220385]=perlquestion: print w/replies, xml ) Need Help??

mje has asked for the wisdom of the Perl Monks concerning the following question:

We are using an API (which I can't tell you much about unfortunately) provided by another party which uses POST over HTTPS. On reviewing the code by an ex coworker I discovered a mysterious call to NFKD which I now realise is in Unicode::Normalise. I could not explain why it was there and tried taking it out but it actually breaks things and I'm hoping someone here might have some insights. The API involves POSTing a number of strings to an HTTPS url and the response contains one of 3 statuses (2 mean a match for the supplied data was found and 1 means a match was not found). The suppliers of the API provide some test data which is supposed to be UTF-8 encoded and I have confirmed that in that I can a) find UTF-8 continuation bytes where there are accents/diacritics etc and b) open the file with ':encoding(UTF-8)' and it is read without errors.

The test code opens the test data file ':encoding(UTF-8), reads a line of strings, POSTs them to the url and gets the response. It then checks the response matches the expected response. When run with the url-encoded POST data simple encoded as UTF-8 with a Content-Type" => 'application/x-www-form-urlencoded ; charset=UTF-8" some of the test data fails. When the data is url-encoded and passed through NFKD all of the tests pass. 1) all of the failing tests contain strings which are non ASCII b) it is obvious they are not matching because the status is returning a non match when they are expected to match. An example is Lubomír,Bartoňová. After passing through NFKD, the accent over the i is much larger.

The actual code is even stranger as it does this to the url-encoded strings ($html is the url-encoded strings)

my $decomposedHtml = NFKD( $html ); $decomposedHtml =~ s/\p{NonspacingMark}//g;

but I have no evidence of NonspacingMark ever being in the normalized string.

It seems unlikely the API provider supplied test data which does not match their dataset so that leaves me wondering a) what might be going wrong and b) how the hell did my ex-colleague discover this - it feels like a bodge.

I would greatly appreciate any possible insights from monks here.

Replies are listed 'Best First'.
Re: Strange Unicode normalization question
by ikegami (Patriarch) on Aug 15, 2018 at 23:01 UTC

    but I have no evidence of NonspacingMark ever being in the normalized string.

    There are three in the example you gave:

    use strict;
    use warnings qw( all );
    use feature qw( say );
    
    use utf8;
    use open ':std', ':encoding(UTF-8)';
    
    use charnames          qw( );
    use Unicode::Normalize qw( NFKD );
    
    my $html = "Lubomír,Bartoňová";
    my $decomposedHtml = NFKD( $html );
    say charnames::viacode(ord($_))
       for $decomposedHtml =~ /(\p{NonspacingMark})/g;
    

    Output:

    COMBINING ACUTE ACCENT COMBINING CARON COMBINING ACUTE ACCENT

    The code you posted is a hack to find an ASCII "equivalent" to the input.

      Thanks again. This code was a bit of a mess and your comments and the others have helped me see what was going wrong. I appologise for now providing better information but there was a lot of code for something which should have been quite simple. This is what the original code did:

      1. Opened data file with encoding(UTF-8)
      2. Read a line of comma separated strings from it and split them on the comma
      3. Put the split fields into a hash with keys describing the data
      4. Passed to hash to a hand written function that tried to produce a x-url-formencoded string but this function was broken and instead just stuck an '&' between each key=value so it wasn't form encoded at all
      5. Passed the resulting string into NFKD and did the substitution as I described earlier
      6. Passed the resulting string into encode to encode as UTF-8
      7. Passed the resulting string into a LWP POST

      So it was horribly broken because it did not form encode properly and then NFKD was a workaround he discovered which I suspect only works because the API does normalization itself (which would not surprise me). I replaced the hand written (incorrect) form encoding with WWW::Form::UrlEncoded build_urlencoded and as you both state the NFKD is a noop as is the substitution and and it works. This was confused because it appears when it didn't work originally (without the NFKD) he was told by the API support to turn diacritics into normal characters. The actual code was a lot more complicated than this and the more I looked at it the more problems I found so I've spent most of the day rewriting it.

      Thanks again for your insights.

      Your answer makes sense, however the OP says $html is the url-encoded strings which I interpret as:

      ... my $html = "Lubom%C3%ADr%2CBarto%C5%88ov%C3%A1" ; my $decomposedHtml = NFKD( $html ); ...

      Which doesn't make sense to me...

        They also said "Lubomír,Bartoňová" is passed through NFKD, which is that part I addressed.

        The OP wrote a lot, but said very little that can be used. I didn't think that asking for a clearer explanation would be useful, so I provide a starting point.

        I do agree that it makes no sense to pass URL-encoded or HTML-encoded text to NFKD. Escapes could prevent it from functioning correctly.

      Thank you. I had not understood what is happening and I think I do now. The NFKD is separating the í into 2 characters and the substitution is removing the 2nd one (the NonspacingMark) leaving an i (similarly for other 2). So "Lubomír,Bartoňová" becomes "Lubomir,Bartonova". Unless this is done there is no match for the combination of strings which include this name (BTW, the name is fictitious - I should have mentioned that). I am at a loss as to why we need to do this via this API which from a UK government organisation but there is no documentation saying this must be done.

        I have no clue what the most stable/robust answer is here but I thought this belonged in the footnotes of the thread: Text::Unidecode. I have used it for normalizing search indexes such that a user typing Francois finds François; maybe similar to your use case.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1220385]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-19 13:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found