in reply to Quick and portable way to determine line-ending string?

Most (all?) platforms will use one or more of 0x0a and 0x0d characters as the newline, so something like:
sub seperator { my $cr = chr(0x0d); my $lf = chr(0x0a); if($_[0] =~ /([$cr$lf]+)/o) { return $1; } #unable to find seperator, handle error here. }
This assumes that the file will not contain an 0x0a or 0x0d unless it is used as part of the newline, which should be true of a textfile.

Update If the first occurance of the newline is multiple newlines (for example "\n\n\nThree lines before me\n") then all of them will match and it will not return the correct line seperator. Best to just check for $cr$lf, $lf$cr, $cr, and $lf individually.

Replies are listed 'Best First'.
Re: Re: Quick and portable way to determine line-ending string?
by bikeNomad (Priest) on Aug 09, 2001 at 02:59 UTC
    Sadly, one can't assume that. For instance, I have seen a number of cases where a supposedly-text file from a Unix system has been edited on a MS-DOSish system and hence contains extra "\x0d" characters.

    Then, of course, there's the question of what to do on EBCDIC systems, where the line endings are likely to be something entirely different.

      On an EBCDIC system, the line endings are probably "\r" and "\n", of course. And there is no point in using "\x0a" and "\x0c" in the previous code. The only use for "\x0a" and "\x0d" are when you might run under MacOS and are using something like a network protocol that requires "\r\n". MacOS made the mistake of changing the definition of "\r" and "\n" rather than translating them. All other system that use non-Unix line endings, _translate_ to/from "\n".

      If it weren't for MacOS, "\r" and "\n" would always be the right choice. The move toward "\x0a" and "\x0c" has been motivated by trying to be portable with MacOS and has caused great confusion. Since very few Perl programmers actually work on the even weirder systems like those that use EBCDIC, the folly of this has not been widely noted ( is one of the few places that I've seen start to notice this).

              - tye (but my friends call me "Tye")
      If the file contains 0x0d characters then they are characters which are supposed to be part of the line seperator and will be caught by the code I wrote. While many unix tools will see the 0x0d as just another character, if you do infact have a file which has 0x0d 0x0a pairs, you probably want 0x0d 0x0a to be your line seperator. If you had mixed 0x0d 0x0a and just 0x0a line seperators then the code won't work but so long as it is consistent it should be fine.
      And Unicode files that use the new linebreak/parabreak characters! Say it's a UTF-8 encoded file... no 0x0A in sight!

      Like Perl itself, you need to be leniant about reading linebreaks. But you need to know the proper form for writing them.