comment on

We are using an API (which I can't tell you much about unfortunately) provided by another party which uses POST over HTTPS. On reviewing the code by an ex coworker I discovered a mysterious call to NFKD which I now realise is in Unicode::Normalise. I could not explain why it was there and tried taking it out but it actually breaks things and I'm hoping someone here might have some insights. The API involves POSTing a number of strings to an HTTPS url and the response contains one of 3 statuses (2 mean a match for the supplied data was found and 1 means a match was not found). The suppliers of the API provide some test data which is supposed to be UTF-8 encoded and I have confirmed that in that I can a) find UTF-8 continuation bytes where there are accents/diacritics etc and b) open the file with ':encoding(UTF-8)' and it is read without errors.

The test code opens the test data file ':encoding(UTF-8), reads a line of strings, POSTs them to the url and gets the response. It then checks the response matches the expected response. When run with the url-encoded POST data simple encoded as UTF-8 with a Content-Type" => 'application/x-www-form-urlencoded ; charset=UTF-8" some of the test data fails. When the data is url-encoded and passed through NFKD all of the tests pass. 1) all of the failing tests contain strings which are non ASCII b) it is obvious they are not matching because the status is returning a non match when they are expected to match. An example is Lubomír,Bartoňová. After passing through NFKD, the accent over the i is much larger.

The actual code is even stranger as it does this to the url-encoded strings ($html is the url-encoded strings)

      my $decomposedHtml = NFKD( $html );
      $decomposedHtml =~ s/\p{NonspacingMark}//g;
[download]

but I have no evidence of NonspacingMark ever being in the normalized string.

It seems unlikely the API provider supplied test data which does not match their dataset so that leaves me wondering a) what might be going wrong and b) how the hell did my ex-colleague discover this - it feels like a bodge.

I would greatly appreciate any possible insights from monks here.

In reply to Strange Unicode normalization question by mje

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks