Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

UTF-8 text files with Byte Order Mark

by muba (Priest)
on Feb 13, 2007 at 16:50 UTC ( [id://599720]=perlquestion: print w/replies, xml ) Need Help??

muba has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on this program that reads two UTF-8 files. One contains a lexicon, the other provides a number of sound change rules. In the end, the program is to apply those rules on the original lexicon and output the soundchanged words.

So far so good. Except that things fail if the BOM (byte order mark) is present in the ruleset file.

I open the file as open my $lexFH, "<:encoding(UTF-8)", $clarg{l} or die "Couldn't open lexicon file $clarg{l}: $!"; so I kinda assume that Perl will handle with this kind of stuff for me.

However, if file contains that BOM, my program does not understand the first line in the file. Ok, so I understand the complete details of why my program has troubles with the line, and in the end it just boils down to the simple fact that it doesn't expect that BOM.

And neither did I. I had hoped that Perl would understand it as part of the utf-8 encoding.

By the way, I read my lines as while (my $line = <$lexFH>) {.

So. The actual question I'm trying to ask is this: how do I make Perl understand the BOM in a way that my program never sees it?

Replies are listed 'Best First'.
Re: UTF-8 text files with Byte Order Mark
by almut (Canon) on Feb 13, 2007 at 17:50 UTC

    Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). Normally, you'd find BOMs with the "ucs-2" encodings, as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.

    Anyway, what you could try is something like this (not sure if this is the most elegant way, but it should work...   Update: it isn't :) - apparently there's File::BOM)

    sub openfile_unicode { my $filename = shift; open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\ +n"; my $bom; read $fh, $bom, 2; if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present? # if so, determine if little- or big-endian my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be"); binmode $fh, ":encoding($encoding)"; } else { # otherwise assume UTF-8 # reopen file close $fh; $fh = undef; open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f +ilename: $!\n"; } return $fh; } my $fh = openfile_unicode("somefile"); while (my $line = <$fh>) { # ... }

        The test file seems to match that three-byte BOM indeed.

        I'm happy to know you don't usualy see utf-8 files with a BOM, but as pointed out below, some programs still store it, such as Notepad. One of my users seems to have a utf-8 file with a BOM too.

      File::BOM does the same thing (and does it better?)

      Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes).

      notepad adds a BOM when you save as UTF-8.

      Many text editors use BOM to distinguish ASCII or local-encoding from UTF
Re: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on Feb 13, 2007 at 17:55 UTC

    so I kinda assume that Perl will handle with this kind of stuff for me.

    Having Perl remove the BOM automatically would be bad. print while <$fh>; would no longer print out a file exactly, for example. It wouldn't be possible to print out a file exactly by other means either.

    However, if file contains that BOM, my program does not understand the first line in the file

    Patient: "Doctor, it hurts when I do this."
    Doctor: "So don't do it!"

    If your program doesn't accept BOMs, don't feed it any. BOMs are not required.

    Alternatively, you could change your spec and your program to accept it.

    while (<$fh>) { s/\x{FEFF}//g; ... }
      Patient: "Doctor, it hurts when I do this."
      Doctor: "So don't do it!"

      Easy to say, of course, but what if the program one of my users uses stores that BOM anyway? Besides, as pointed out, a BOM in a utf-8 file *are* valid so I feel I should support it. Look, if the user was toying around with malformed files I'd be more than happy to tell him to get that fixed :D but apparently he's doing what he righteously thinks is righs.

        a BOM in a utf-8 file *are* valid

        "!" in an ASCII file is also valid. But if you place a "!" at the start of your Perl program, it probably will not compile. It is a malformed file, not from a UNICODE perspective, but from your parser's perspective.

        I provided two alternatives (removing the BOM and File::BOM) that will work with your broken tools (i.e. tools that add undesirable character to the files you edit). I'd go with them since allowing the BOM is surely a good thing.

      "If your program doesn't accept BOMs, don't feed it any. BOMs are not required. "

      This is a mindbogglingly stupid statement that ignores or even stands on its head the Robustness principle.
      Anyone who writes something so inane and so dangerous should be barred for life from software development.

Re: UTF-8 text files with Byte Order Mark
by Joost (Canon) on Feb 13, 2007 at 18:01 UTC

      Yeah, this works, except that the BOM indeed is a three-bytes thing as said above. So the code, that seems to work, now looks like this:

      while (my $line = <$rulesFH>) { if ($. == 1) { # Remove Byte Order Mark if it's there use Encode; my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets); } # rest... }
        my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets);

        is the same thing as

        my $BOM = decode("utf8", "\x{ef}\x{bb}\x{bf}"); $line =~ s/^$BOM//;

        is the same thing as

        my $BOM = chr(0xFEFF); $line =~ s/^$BOM//;

        is the same thing as

        $line =~ s/^\x{FEFF}//;

        which is what I gave you. Much simpler!

Re: UTF-8 text files with Byte Order Mark
by Anonymous Monk on Dec 08, 2022 at 08:45 UTC

    I agree that this should be done automatically if the UTF-8 IO layer is specified. The fact that UTF-8 files with a BOM are rare make this more important. I'm willing to bet that there are many Perl scripts out there that read UTF-8 files and that will break the first time they encounter a file with a BOM.

Re: UTF-8 text files with Byte Order Mark
by freonpsandoz (Beadle) on Sep 19, 2016 at 03:57 UTC

    If your program doesn't accept BOMs, don't feed it any. BOMs are not required.

    BOMs are required in some types of UTF-8 files. Try loading a UTF-8 cue sheet or m3u8 playlist without a BOM into Foobar2000 sometime...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://599720]
Approved by Joost
Front-paged by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-23 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found