http://qs321.pair.com?node_id=198189


in reply to Handling Mac, Unix, Win/DOS newlines at readtime...

Good Monks!

I'd think that using \r or \n in this script should be considered harmful. If you'd really want to be portable (which might not be the case) you should exclusively use \015 and \012 to match DOS/Linux/Mac CRLF's

Imagine this script to be run on some box using EBCDIC (I know this is most probably a hypothetic assumtion but, oh well, I just want to demonstrate something here...)

Read perlebcdic. I has an example saying:

$is_ebcdic_37 = "\n" eq chr(37); $is_ebcdic_1047 = "\n" eq chr(21);

Uh-Oh... That means if you'd split on an EBCDIC system's perl on \n, you'd actually split on '%' or NAK respectively.

You really want to use HTML::Parser (or even XML::Parser) to parse your input. At least do something like this (untested):

@lines = split /\012\015?|\015\012?/, $file;

Whatever approach you'll choose, input normalization is not really a trivial problem...

So long,
Flexx

Replies are listed 'Best First'.
Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by blm (Hermit) on Sep 16, 2002 at 11:33 UTC

    Beat me to but I am not as revered as Aristotle. The only way I knew is that I got caught out badly reading dos text files on linux. Everyone that uses perl on different platforms, be they Mac, *nix and/or Windows/DOS based, should read perlport for the low down on when \n is \012 or \015 or \015\012

    .

    To quote some words of wisdom:

    In most operating systems, lines in files are terminated with newlines. Just what is used as a newline may vary from OS to OS. Unix traditionally uses \012, one kind of Windows I/O uses \015\012, and Mac OS uses \015.

    Perl uses \n to represent the ``logical'' newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. In DOSish perls, \n usually means \012, but when accessing a file in ``text'' mode, STDIO translates it to (or from) \015\012.

    Amen ;-)

Re^2: Handling Mac, Unix, Win/DOS newlines at readtime...
by Aristotle (Chancellor) on Sep 16, 2002 at 09:43 UTC
    Beat me to it. That's the only sane proposal here. ++

    Makeshifts last the longest.

      Wow! Reading that from you makes me proud (and I mean this 100% honest!).
Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by John M. Dlugosz (Monsignor) on Sep 25, 2002 at 19:32 UTC
    I think the problem is only when the data being processed doesn't use the same encoding as the source file.

      Why? Do you mean the perl scripts 'source' encoding? That doesn't really matter here.

      Perl's impression of what \r and \n mean differs from system to system. A Mac would output \015 when it sees \n while a unix box would generate \012 (IIRC now, it might also be the other way round... ;).

      So if a script tries to handle DOS text files using \r\n to split, chomp, substitute, etc. input, it will work on a system that has a compatible native encoding (ASCII). If, however, you run the same script on a noncompatible system (for example on an EBCDIC platform like IBM's AS400, or even something more common like a Pre-OS X Mac) it will do really funny things with your input, as it translates \r\n to some other ordinals, regardless of the scripts source encoding (which will most likely be the platforms "native" encoding).

      Same is true of course, when you want to operate on files coming from AS400 machines on your unix box (which is more likely, I guess). If you'd try split these files on (unix') \n, you'll hit anything but a line end...

      To sum up: DOS files want to be split on \015\012, regardless of what line terminators the splitting sytem uses...

      So long,
      Flexx