perltidy and UTF-8 BOM

morelenmir has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: perltidy and UTF-8 BOM
by haukex (Archbishop) on May 21, 2018 at 09:37 UTC

It seems that Perl::Tidy only supports UTF-8 and the default ("none") encodings, and it does not recognize the BOM - when I run it with the -utf8 switch on a UTF-8 file with a BOM, I get "unexpected character decimal 65279 () in script".

While the UTF-8 BOM might be useful to your text editor, it isn't really useful to Perl: it is completely ignored, and you still need an explicit use utf8; (see the caveats at the beginning of perlunicode). Note this is different for UTF-16, where the BOM will cause Perl to automatically use that encoding. But anyway, you might want to consider whether you need the BOM, since many text editors default to UTF-8 anyway, and if you're worried someone might take your UTF-8 encoded Perl source and open it with an incorrect encoding, remember that there's still the use utf8; at the top of the file. In fact, I sometimes write "use utf8; # Euro Symbol: €" so that I have an immediate visual clue as to whether the text editor used the right encoding (why I like that symbol, plus it's easy to type on my German keyboard :-) ).

Other than filing a bug in the issue tracker and/or writing a patch, you could work around the issue by writing a small wrapper for perltidy that strips the BOM first and adds it back in again after:

perl -wM5.012 -CDS -pe '$.==1 && s/\A\x{FEFF}//' utf8_bom.pl \
    | perltidy -utf8 \
    | perl -wM5.012 -CDS -pe 'INIT{print "\x{FEFF}"}' \
    > utf8_bom_tidy.pl
[download]

[reply]
[d/l]
[select]

Re^2: perltidy and UTF-8 BOM

by AnomalousMonk (Archbishop) on May 21, 2018 at 13:57 UTC

... s/\A\x{FEFF}// ... print "\x{FEFF}" ...

Again, a total UTF-8 n00b (or maybe nob) here, but isn't the UTF-8 BOM "\xEF\xBB\xBF" (239 187 191) (see this and the perltidy complaint cited in the OP)? (And \x{FEFF} works out to decimal 65279.)

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^3: perltidy and UTF-8 BOM

by haukex (Archbishop) on May 21, 2018 at 14:17 UTC

The Byte order mark is the Unicode character U+FEFF, and depending on the encoding it is encoded as different bytes - see "Byte order marks by encoding" on the same Wikipedia page. Because I've changed STDIN and STDOUT to be UTF-8 with the command-line switch -CS, I can use the Unicode representation and don't need to look at the bytes (although I could do that too, but I figured since everything is UTF-8 already anyway...).

$ perl -wMstrict -CSD -e 'print "\x{FEFF}"' | hexdump -C
00000000  ef bb bf                                          |...|
$ perl -wMstrict -e 'binmode STDOUT, ":raw:encoding(UTF-8)";     print
+ "\x{FEFF}"' | hexdump -C
00000000  ef bb bf                                          |...|
$ perl -wMstrict -e 'binmode STDOUT, ":raw:encoding(UTF-16-LE)"; print
+ "\x{FEFF}"' | hexdump -C
00000000  ff fe                                             |..|
$ perl -wMstrict -e 'binmode STDOUT, ":raw:encoding(UTF-16-BE)"; print
+ "\x{FEFF}"' | hexdump -C
00000000  fe ff                                             |..|
$ perl -wMstrict -e 'binmode STDOUT, ":raw:encoding(UTF-32-LE)"; print
+ "\x{FEFF}"' | hexdump -C
00000000  ff fe 00 00                                       |....|
$ perl -wMstrict -e 'binmode STDOUT, ":raw:encoding(UTF-32-BE)"; print
+ "\x{FEFF}"' | hexdump -C
00000000  00 00 fe ff                                       |....|
[download]

[reply]
[d/l]
[select]

Re: perltidy and UTF-8 BOM
by AnomalousMonk (Archbishop) on May 21, 2018 at 04:42 UTC

Total UTF-8 n00b here, but I thought the "byte order" idea of this encoding was "one byte after another from the beginning of the text stream to the end". (Of course, each character in this encoding can be one to four bytes, but the order of the bytes in a character is invariant.) Indeed, this source sez WRT UTF-8 byte order:

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8 ... [emphasis added]

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]

Re^2: perltidy and UTF-8 BOM

by afoken (Chancellor) on May 22, 2018 at 20:00 UTC

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8

what's the reason [...] using a BOM for [...] UTF-8 [...]?

UTF-8 encoded text should - in theory - not need a BOM, that's correct. But there are only very few cases (see below) in which a BOM causes trouble. So many editors (and other text-processing tools) automatically switch to UTF-8 encoding when they find a BOM encoded as UTF-8 (0xEF, 0xBB, 0xBF) at file offset 0. This is often completely analogous to finding a BOM encoded in UTF-16 BE, UTF-16 LE, UTF-32 BE, UTF-32 LE. Without a BOM, they usually guess. UTF-16 and UTF-32 can often be guessed by the amount and position of 0x00 bytes. UTF-8 can also be guessed, but it is harder and can be mixed up with some legacy encoding.

So, prefixing UTF-8 encoded text with a BOM makes life easier for most tools, that's all.

The Unix #! mechanism is broken by a leading BOM, simply because the kernel expects the first two bytes of the file to be 0x23, 0x21. The BOM takes up two to four bytes and is often invisible in editors. The kernel sees an invalid magic number and so does not consider the file as a script, while the user believes that the file starts with #!. (Adding support for scripts with a BOM should be quite easy, by simply treating 0xEF 0xBB 0xBF 0x23 0x21 at the start of a file like 0x23 0x21 at the start of a file.)

https://validator.w3.org/ warns if input starts with a BOM, claiming that old editors and old browsers have problems with the BOM.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

[reply]
[d/l]

Re^3: perltidy and UTF-8 BOM

by ikegami (Patriarch) on May 22, 2018 at 20:15 UTC

UTF-8 can also be guessed, but it is harder and can be mixed up with some legacy encoding.

Not really. The problem is the amount of lookahead needed. With a BOM, one can be sure after reading just a few bytes.

[reply]

Re^3: perltidy and UTF-8 BOM

by 1nickt (Canon) on May 23, 2018 at 12:18 UTC

Also, while JSON strings are required to not have a leading BOM, consumers *should* be able to handle it, according to the spec. However, of Perl's JSON libraries, only Cpanel::JSON::XS handles the case without exception.

The way forward always starts with a minimal test.

[reply]

Re^2: perltidy and UTF-8 BOM

by morelenmir (Beadle) on May 21, 2018 at 13:02 UTC

A good question!!! I have found in the past when transferring the same text file--encoded with UTF-8--between different editors, unless it had a BOM then a corrupted display of the non-ASCII characters would occur. I encountered this especially between either 'EditPad Pro' or the newer versions of 'Notepad' and 'Programmer's File Editor'. The latter is now a very old but in its day extremely handy text editor which I used for the majority of the 2000's. I also had issues with non-BOM unicode text and the free version of 'Take Command Console' which I use exclusively instead of the native console in Windows. This is a generally excellent 'DOS' replacement but it does not support UTF-8--so again without a BOM I found weird things happened and the last time I spoke to the chap who writes TCC he was pretty militant about only offering UTF-16 output from his console commands. So just as a carte blanche fix I applied a BOM to all unicoded files whether UTF-8 or UTF-16 and never considered it again. These days I use EPP for all my editing so probably could live without it, but there would be a lot of files to re-edit and remove the BOM from! Even then I'd still run in to issues with TCC however as I also launch the Perl runtime and debugger through it. At the end of the day, as you say UTF-8 shouldn't need a BOM but I have found--other than perltidy!!!--that employing one helps more than it hinders.

I will try that idea of stripping and then reapplying the BOM.

As an aside: I am afraid I do not know how to quote and reply to individual posts in this forum system so I am having to do so en masse. My apologies if this appears somewhat confusing because of it.

"Aure Entuluva!" - Hurin Thalion at the Nirnaeth Arnoediad.

[reply]

Re^3: perltidy and UTF-8 BOM

by haukex (Archbishop) on May 21, 2018 at 13:19 UTC

I am afraid I do not know how to quote and reply to individual posts in this forum system so I am having to do so en masse.

In the thread view, every post has a [reply] link in the bottom right corner of the post.

In the individual view, there's the Comment on link underneath the post. (Update: Tux pointed out that this link could also be above the post, thanks!)

vvv here vvv

or here --->

[reply]

Re^3: perltidy and UTF-8 BOM

by AnomalousMonk (Archbishop) on May 21, 2018 at 13:55 UTC

... how to quote ...

I'm not sure this is what you're referring to, but there is a <blockquote> ... quoted text ... </blockquote> tag that this site supports; see Markup in the Monastery and Writeup Formatting Tips. (Many monks, including myself, embed italics tags within the blockquote tags to further distinguish the quote. The end result is then <blockquote><i> ... </i></blockquote>.)

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^2: perltidy and UTF-8 BOM

by ikegami (Patriarch) on May 22, 2018 at 19:30 UTC

A BOM is frequently used to identify UTF-8 files even if the concept of byte order doesn't exist in UTF-8. Remember, the BOM is really just U+FEFF ZERO WIDTH NO-BREAK SPACE, an completely invisible character.

[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks