line ending troubles

Dirk80 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I want to get rid of the line ending troubles. Until now I had files which had these line endings: 0D 0A (Windows), 0A (Unix), 0D (early MAC). The last file I've seen had mixed MAC and Unix line endings and that was then the reason why my code was not working.

I am now seeking for a solution which is reading a file line per line independent of the three line ending types. Also my goal is it that the line ending is chomped, so that I only have the data of each line stored in an array.

I tried to solve this with the following code:

#!/usr/bin/perl
use strict;
use warnings;

use Perl6::Slurp;

# generate test file "le.txt"
my $win_line = "Windows\r\n";
my $unix_line = "Unix\n";
my $mac_line = "Mac\r";

open(my $fh, ">", "le.txt");
binmode($fh);

print $fh $win_line;
print $fh $unix_line;
print $fh $mac_line;

close($fh);


# read file with slurp

my @lines = slurp("le.txt", {chomp => 1, irs => qr/(\r\n)|(\n)|(\r)/})
+;
for my $line (@lines)
{
    print  $line . "\n";
}
[download]

But the read code is not working and I don't know why. Any hints are welcome. Here I wrote some other code to read the file with the differrent line endings. This code is working.

# read file
my $file_content;
{
    local $/ = undef; # no input data record separator --> file conten
+t will be stored in one scalar
    open(my $fh, "<", "le.txt");
    binmode($fh);

    $file_content = <$fh>;
    close($fh);
}

(my $space_file_content = $file_content) =~ s/(\r\n)|(\n)|(\r)/ /g;
for my $line (split(/ /, $space_file_content))
{
    print $line . "\n";
}
[download]

Although the second solution is working it would be very interesting for me how you would solve this problem and why my slurp code is not working.

Thank you very much for your help.

Dirk

Comment on line ending troubles Select or Download Code

Replies are listed 'Best First'.
Re: line ending troubles by kennethk (Abbot) on Dec 21, 2009 at 20:31 UTC
The issue is with your regular expression. I'm not feeling particularly like delving into the Perl6::Slurp source right now, but changing your regex from capturing to not yields what I assume to be your intended output: #!/usr/bin/perl use strict; use warnings; use Perl6::Slurp; # generate test file "le.txt" my $win_line = "Windows\r\n"; my $unix_line = "Unix\n"; my $mac_line = "Mac\r"; open(my $fh, ">", "le.txt") or die "Failed file open: $!"; binmode($fh); print $fh $win_line; print $fh $unix_line; print $fh $mac_line; close($fh); # read file with slurp #my @lines = slurp("le.txt", {chomp => 1, irs => qr/(\r\n)\|(\n)\|(\r)/} +); my @lines = slurp("le.txt", {chomp => 1, irs => qr/\r\n\|\n\|\r/}); for my $line (@lines) { print $line . "\n"; } [download] I assume this means that the module makes use of the capture buffers internally, which your captures overwrite. Also notice I added an `or die` clause to your open statement, since that's usually a Good Thing(TM). For your pure regular expression solution, is there a reason you didn't just split on a regular expression, a la: `for my $line (split /\r\n?\|\n/, $file_content) { print $line . "\n"; }` [download] Update: I looked at my inbox, and decided I did feel like delving. Your issue is that capturing parentheses mean 'include my delimiter in the result set' (see split, or run the code `split /(k)/, 'onektwokthree'`). Since Perl6::Slurp uses split to process the results, it inserts your delimeters into the result stream. In fact, because Perl6::Slurp already uses delimiter capturing in the result (I don't see why, but see line 106), you end up with a real mess in the resulting split. The module then drops every other element of the array, which drops some of your results. The initial split results are: `@line = ("Windows", "\n", "", "\n", "", "Unix", "\n", "", "\n", "", "Mac", "\r", "", "\r", "");` ~~The module was written by Damian Conway, who is much smarter than I am. Anyone know why he'd use parens in the split and then manually drop alternating terms?~~ He used the delimiter capture to control the chomp behavior.	[reply] [d/l] [select]
Re^2: line ending troubles by Dirk80 (Pilgrim) on Dec 22, 2009 at 22:08 UTC
Thank you very much for your excellent answer. My fault was that I did not know that the brackets of split have this effect. But now another question to alternatives and regexps. In my tests I have seen that the order of the alternatives is important. Is it really always true that the first alternative is tried first, then the second, third ... ? And one more question to slurp. I've seen that when I'm running perl in windows the crlf-layer is active by default. Of course I can pop this layer with binmode or :raw. But if I don't do that. Will slurp call the crlf layer if no layer is specified? Greetings Dirk	[reply]
Re^3: line ending troubles by kennethk (Abbot) on Dec 22, 2009 at 23:01 UTC
Regarding matching behaviors with alternation, see Matching this or that in perlretut. Short answer, yes. `slurp` as implemented in Perl6::Slurp v0.03 (what I'm using for reference), calls a 3-argument open with mode = '<' if no layer information is passed. This means it will behave like a normal file open on your OS, which as you've observed includes the crlf-layer by default under Windows. See Defaults and how to override them in PerlIO. If you haven't reviewed it yet, you should read about Newlines in perlport.	[reply] [d/l]
Re^4: line ending troubles by Dirk80 (Pilgrim) on Dec 25, 2009 at 00:31 UTC
Re^5: line ending troubles by kennethk (Abbot) on Dec 28, 2009 at 22:18 UTC
Re: line ending troubles by almut (Canon) on Dec 21, 2009 at 22:24 UTC
Yet another way would be to write a custom PerlIO layer (similar in spirit to `:crlf`, but for reading only), e.g. using PerlIO::via. In its most simple form it could look something like: `package PerlIO::via::AnyCRLF; # save as PerlIO/via/AnyCRLF.pm sub PUSHED { my ($class) = @_; my $dummy; return bless \$dummy, $class; } sub FILL { my ($self, $fh) = @_; my $len = read $fh, my $buf, 4096; if (defined $buf) { $buf =~ s/\r\n/\n/g; $buf =~ s/\r/\n/g; } return $len > 0 ? $buf : undef; } 1;` [download] Sample usage: `#!/usr/bin/perl use PerlIO::via::AnyCRLF; open my $f, "<:via(AnyCRLF)", "le.txt" or die $!; print while <$f>;` [download] Handling the corner case (when `\r\n` gets split such that `\r` is in one buffer read, and `\n` in the next) is left as an exercise for the reader ;) — A quick fix could be to delegate the `\r\n` to `\n` translation to the regular `:crlf` layer (i.e. `"<:crlf:via(AnyCRLF)"`), and only do the `\r` to `\n` translation in this layer...	[reply] [d/l] [select]
Re^2: line ending troubles by Dirk80 (Pilgrim) on Dec 22, 2009 at 22:23 UTC
Thank you for the hint with the IO layers. Very interesting. Because I never used object oriented programming in perl and knew nothing about layers, I had to read first some stuff to understand it. Now I understand your code completely and tried it in my environment. And it is working. Then I tried to implement your suggestion to avoid the corner case by using the crlf layer. If I understand you right the solution is as follows: `package PerlIO::via::AnyCRLF; # save as PerlIO/via/AnyCRLF.pm sub PUSHED { my ($class) = @_; my $dummy; return bless \$dummy, $class; } sub FILL { my ($self, $fh) = @_; my $len = read $fh, my $buf, 4096; if (defined $buf) { $buf =~ s/\r/\n/g; } return $len > 0 ? $buf : undef; } 1;` [download] `#!/usr/bin/perl use strict; use warnings; use PerlIO::via::AnyCRLF; open my $f, "<:crlf:via(AnyCRLF)", "le.txt" or die $!; print while <$f>;` [download] Greetings, Dirk	[reply] [d/l] [select]
Re^3: line ending troubles by almut (Canon) on Dec 22, 2009 at 22:55 UTC
If I understand you right the solution is as follows: ... Exactly. Maybe it's worth pointing out that when you have multiple layers, the order in which they are being applied (which does matter here) is from left to right when reading, and from right to left when writing (which you aren't doing in this case, but good to know anyway :) `----- reading ----> external side ":crlf:via(AnyCRLF)" (file) <---- writing -----` [download]	[reply] [d/l]
Re^4: line ending troubles by Dirk80 (Pilgrim) on Dec 26, 2009 at 16:22 UTC
Re^5: line ending troubles by almut (Canon) on Dec 27, 2009 at 23:10 UTC
Re: line ending troubles by planetscape (Chancellor) on Dec 22, 2009 at 00:22 UTC
Use dos2unix, flip or flip: Newline conversion between Unix, Macintosh and MS-DOS ASCII files to convert. Or see Super Search: newline regex. HTH, planetscape	[reply]
Re: line ending troubles by bobf (Monsignor) on Dec 21, 2009 at 23:51 UTC
I ran into a similar problem a while ago. You may find the solutions presented in Newlines: reading files that were created on other platforms to be helpful. The two main options that I considered were: Use a regex to match the newline character(s) in the file. I think this would require slurping the whole file and then doing something like `if( $file =~ m/\015$/ )` (which assumes the file will end with a newline) or `if( $file =~ m/\015(?!\012)/ )` (which doesn't), setting $/ according to what matched, and re-reading the file line-by-line. If the file contains different types of newlines this will not work. Preprocess the input file to convert all newline characters to the current system's newline character. `$file =~ s[(\015)?\012(?!\015)][\n]g; $file =~ s[(\012)?\015(?!\012)][\n]g;` [download] I used the latter option. It worked beautifully.	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks