|Keep It Simple, Stupid|
Re: Encoding TermExtractby roboticus (Chancellor)
|on Jul 21, 2020 at 19:11 UTC||Need Help??|
I was bored, so I installed the library and ran your example.
The input file appears to be a properly encoded UTF8 file. (I say "appears" because I'm no expert in Unicode, and only looked over the first handful of characters in the file.)
The output file, though, is definitely weird looking. It, too, appears to be properly encoded UTF8, but the characters look like Mojibake, because while the characters were properly encoded, none of them appear to be from the CJK character set that appear to dominate the input file.
I tried removing the ":encoding(UTF-8)" bit from your open statement and (as expected) the garbage is different: In this case, the encoding is now broken, but it looks "closer" since the data that shows up looks like fragments of UTF-8 encodings with prefixes in the CJK character set:
Above, the first two bytes of the UTF-8 sequences look like they're from the CJK character set, but the third byte is out of the expected range. It looks like one of the steps in the bowels of the TermExtract software is truncating the characters due to encoding/decoding errors (likely) and/or mixing byte lengths and character lengths (possibly?).
I did a *quick* review of the TermExtract and found no encode/decode calls anywhere in the code. There were several functions that would open a file and read the data, but without any encoding/decoding. The most disturbing thing I found is a pair of functions (cut_GB in ChainesPlainTextGB, and cut_utf8 in ChainesPlainTextUC) that operate on the data as raw bytes.
I'm not going to dig any deeper than this, but it may be possible to modify the module to properly handle encoding/decoding. If I were going to try to do so, I'd examine all the locations where cut_utf8(), cut_GB() and substr() are called (there aren't all that many). If the encoding was handled correctly, you wouldn't need to have the hackery in cut_GB() and cut_utf8() to determine character lengths, nor would you (presumably) need to use anything other than 1 for the size in substr(). There may be other functions you'd need to review, but I can't think of any at the moment.
When your only tool is a hammer, all problems look like your thumb.