UTF-8 text files with Byte Order Mark

muba has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: UTF-8 text files with Byte Order Mark by almut (Canon) on Feb 13, 2007 at 17:50 UTC
Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). Normally, you'd find BOMs with the "ucs-2" encodings, as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters. Anyway, what you could try is something like this (not sure if this is the most elegant way, but it should work... Update: it isn't :) - apparently there's `File::BOM`) sub openfile_unicode { my $filename = shift; open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\ +n"; my $bom; read $fh, $bom, 2; if ($bom eq "\xff\xfe" \|\| $bom eq "\xfe\xff") { # BOM present? # if so, determine if little- or big-endian my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be"); binmode $fh, ":encoding($encoding)"; } else { # otherwise assume UTF-8 # reopen file close $fh; $fh = undef; open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f +ilename: $!\n"; } return $fh; } my $fh = openfile_unicode("somefile"); while (my $line = <$fh>) { # ... } [download]	[reply] [d/l] [select]
Re^2: UTF-8 text files with Byte Order Mark by Joost (Canon) on Feb 13, 2007 at 17:53 UTC
Actually, I would be a little surprised to find a BOM in combination with UTF-8 Yeah, you don't usually see utf-8 files with a BOM. Nevertheless, it's perfectly valid update: note that the utf-8 BOM consists of three bytes: "EF BB BF" "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: UTF-8 text files with Byte Order Mark by muba (Priest) on Feb 13, 2007 at 20:03 UTC
The test file seems to match that three-byte BOM indeed. I'm happy to know you don't usualy see utf-8 files with a BOM, but as pointed out below, some programs still store it, such as Notepad. One of my users seems to have a utf-8 file with a BOM too.	[reply]
Re^2: UTF-8 text files with Byte Order Mark by ikegami (Patriarch) on Feb 13, 2007 at 18:08 UTC
File::BOM does the same thing (and does it better?)	[reply]
Re^2: UTF-8 text files with Byte Order Mark by ikegami (Patriarch) on Feb 13, 2007 at 18:05 UTC
Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). `notepad` adds a BOM when you save as UTF-8.	[reply] [d/l]
Re^2: UTF-8 text files with Byte Order Mark by Anonymous Monk on Mar 18, 2010 at 06:37 UTC
Many text editors use BOM to distinguish ASCII or local-encoding from UTF	[reply]
Re: UTF-8 text files with Byte Order Mark by ikegami (Patriarch) on Feb 13, 2007 at 17:55 UTC
so I kinda assume that Perl will handle with this kind of stuff for me. Having Perl remove the BOM automatically would be bad. `print while <$fh>;` would no longer print out a file exactly, for example. It wouldn't be possible to print out a file exactly by other means either. However, if file contains that BOM, my program does not understand the first line in the file Patient: "Doctor, it hurts when I do this." Doctor: "So don't do it!" If your program doesn't accept BOMs, don't feed it any. BOMs are not required. Alternatively, you could change your spec and your program to accept it. `while (<$fh>) { s/\x{FEFF}//g; ... }` [download]	[reply] [d/l] [select]
Re^2: UTF-8 text files with Byte Order Mark by muba (Priest) on Feb 13, 2007 at 20:05 UTC
Patient: "Doctor, it hurts when I do this." Doctor: "So don't do it!" Easy to say, of course, but what if the program one of my users uses stores that BOM anyway? Besides, as pointed out, a BOM in a utf-8 file are valid so I feel I should support it. Look, if the user was toying around with malformed files I'd be more than happy to tell him to get that fixed :D but apparently he's doing what he righteously thinks is righs.	[reply]
Re^3: UTF-8 text files with Byte Order Mark by ikegami (Patriarch) on Feb 13, 2007 at 20:36 UTC
a BOM in a utf-8 file are* valid* "`!`" in an ASCII file is also valid. But if you place a "`!`" at the start of your Perl program, it probably will not compile. It is a malformed file, not from a UNICODE perspective, but from your parser's perspective. I provided two alternatives (removing the BOM and File::BOM) that will work with your broken tools (i.e. tools that add undesirable character to the files you edit). I'd go with them since allowing the BOM is surely a good thing.	[reply] [d/l] [select]
Re^4: UTF-8 text files with Byte Order Mark by muba (Priest) on Feb 13, 2007 at 20:43 UTC
Re^2: UTF-8 text files with Byte Order Mark by Anonymous Monk on Jul 24, 2019 at 20:56 UTC
"If your program doesn't accept BOMs, don't feed it any. BOMs are not required. " This is a mindbogglingly stupid statement that ignores or even stands on its head the Robustness principle. Anyone who writes something so inane and so dangerous should be barred for life from software development.	[reply]
Re: UTF-8 text files with Byte Order Mark by Joost (Canon) on Feb 13, 2007 at 18:01 UTC
A BOM is part of the text and it's a (sort of) valid character "ZERO WIDTH NON-BREAKING SPACE". Your best bet is just to strip it off since it's use (aside from providing a BOM) isn't recommended anyway: `while (my $line = <>) { $line =~ /^\x{FEFF}//; # strip BOM # rest }` [download] "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^2: UTF-8 text files with Byte Order Mark by muba (Priest) on Feb 13, 2007 at 20:21 UTC
Yeah, this works, except that the BOM indeed is a three-bytes thing as said above. So the code, that seems to work, now looks like this: `while (my $line = <$rulesFH>) { if ($. == 1) { # Remove Byte Order Mark if it's there use Encode; my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets); } # rest... }` [download]	[reply] [d/l]
Re^3: UTF-8 text files with Byte Order Mark by ikegami (Patriarch) on Feb 13, 2007 at 20:48 UTC
`my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets);` [download] is the same thing as `my $BOM = decode("utf8", "\x{ef}\x{bb}\x{bf}"); $line =~ s/^$BOM//;` [download] is the same thing as `my $BOM = chr(0xFEFF); $line =~ s/^$BOM//;` [download] is the same thing as `$line =~ s/^\x{FEFF}//;` [download] which is what I gave you. Much simpler!	[reply] [d/l] [select]
Re^4: UTF-8 text files with Byte Order Mark by muba (Priest) on Feb 13, 2007 at 21:01 UTC
Re^4: UTF-8 text files with Byte Order Mark by Anonymous Monk on Sep 30, 2011 at 18:30 UTC
Re^5: UTF-8 text files with Byte Order Mark by ikegami (Patriarch) on Oct 01, 2011 at 21:53 UTC
Some notes below your chosen depth have not been shown here
Re: UTF-8 text files with Byte Order Mark by Anonymous Monk on Dec 08, 2022 at 08:45 UTC
I agree that this should be done automatically if the UTF-8 IO layer is specified. The fact that UTF-8 files with a BOM are rare make this more important. I'm willing to bet that there are many Perl scripts out there that read UTF-8 files and that will break the first time they encounter a file with a BOM.	[reply]
Re: UTF-8 text files with Byte Order Mark by freonpsandoz (Beadle) on Sep 19, 2016 at 03:57 UTC
If your program doesn't accept BOMs, don't feed it any. BOMs are not required. BOMs are required in some types of UTF-8 files. Try loading a UTF-8 cue sheet or m3u8 playlist without a BOM into Foobar2000 sometime...	[reply]


Keep It Simple, Stupid
	PerlMonks