Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Malformed UTF-8 character

by pryrt (Abbot)
on Nov 29, 2022 at 21:25 UTC ( [id://11148451]=note: print w/replies, xml ) Need Help??


in reply to Malformed UTF-8 character

When I go to ?abspart=1;part=1;displaytype=displaycode;node_id=11148406 and SaveAs, and open it in Notepad++, Notepad++ sees the encoding as "ANSI" (which on my system is "Windows-1252"); when I run that, it gives me "Malformed UTF-8 character" error, because it's the single byte 0x96 but the use utf8 line has told the interpreter that the file should be interpreted as UTF-8... and UTF-8 doesn't have a single-byte 0x96. If I copy the contents manually from the browser, and instead paste into a new file in Notepad++ (which defaults to UTF-8 for me) and save it and run, it runs just fine. Alternatively, if I comment out use utf8 on the downloaded version, it also works.

The problem is that the perlmonks website serves the pages as Content-Type: text/plain; charset=ISO-8859-1 (even though, technically, is at codepoint 0x96 in Windows-1252, but not in ISO-8859-1, where 0x96 is a control character), so any bytes that get saved use that encoding; but saying use utf8 tells perl to interpret bytes in the source code as UTF-8 -- so it tries to interpret the ISO-8850-1 or Windows-1252 bytes as UTF-8, and fails on codepoints above 127.

Replies are listed 'Best First'.
Re^2: Malformed UTF-8 character
by BillKSmith (Monsignor) on Dec 01, 2022 at 19:09 UTC
    You have put me on the right track. I have found gvim commands to tell it that input is in CP1252 and output should be in utf-8. This converts the 96 to e28093 (u-2013 EN DASH). The resulting file runs in perl and pastes pack into perlmonks correctly. The character still does not display correctly in gvim or the windows command prompt. The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem.
    Bill
      The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem.

      As much as it pains me to say it (given my Notepad++ fandom), it does seem like overkill. But iconv.exe comes with my Strawberry perl... and if it does with yours, then it can handle the translation. (Or gnuwin32's iconv). I believe one of the following two would properly translate the CP1252 encoding of the emdash into UTF-8.

      iconv -f ISO-8859-1 -t utf-8 savedfile > outfile.pl iconv -f CP1252 -t utf-8 savedfile > outfile.pl

      (Of course, the other fix is to not use utf8; after you download the script; perl will default to your native Windows encoding {if I understand things correctly}, so that should work -- at least, it did for me from that same downloaded source code.)

        Thanks for the reminder to check Strawberry (and perl) utilities occasionally.

        Note: The character in question (\x96) is one of the differences between ISO-8859-1 and CP1252. See the difference in the character starting at location 12. The savedfile is from the original post Regex: matching any Number then a hyphen.

        C:\Users\Bill\forums\monks>xxd savedfile 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 9620 4261 720d 0a39 3939 392e 2042 8 . Bar..9999. B 00000020: 617a 0d0a az.. C:\Users\Bill\forums\monks>iconv -f ISO-8859-1 -t utf-8 savedfile > ou +tfile.txt C:\Users\Bill\forums\monks>xxd outfile.txt 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 c296 2042 6172 0d0a 3939 3939 2e20 8 .. Bar..9999. 00000020: 4261 7a0d 0a Baz.. C:\Users\Bill\forums\monks>iconv -f CP1252 -t utf-8 savedfile > outfil +e.txt C:\Users\Bill\forums\monks>xxd outfile.txt 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 e280 9320 4261 720d 0a39 3939 392e 8 ... Bar..9999. 00000020: 2042 617a 0d0a Baz..

        Life was so much easier fifty years ago. Oh, there really were two keypunch codes.

        Bill
      if you use utf8 only for strings and not for variable names, you could convert your special characters to \N{...} notation, such als "\N{EN DASH}"

      Of course, it's your decision whether "Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk" is more readable than something like "Bj�rk" or not ;-)

      N.B.: For the \N escape to work in Perl older than 5.16, you need an explicit use charnames;
        soonix wrote the following, in reply to BillKSmith,
        > if you use utf8 only for strings.... it's your decision

        Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a use utf8; in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma.

        The best would be if perlmonks would serve posts and [download]s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who [download]s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who [download]s the code from perlmonks to have to search through every piece of code they download from perlmonks that has use utf8; and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character.

        ¤: oneliner = perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl

        I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent.
        Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148451]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-04-19 16:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found