Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Unix and Windows CRLF vs LF

by SavannahLion (Pilgrim)
on May 13, 2009 at 05:50 UTC ( [id://763674] : perlquestion . print w/replies, xml ) Need Help??

SavannahLion has asked for the wisdom of the Perl Monks concerning the following question:

My SearchFU needs help. :( I remember dealing with exactly this sort of thing many moons ago, but I can't remember how I eventually resolved this. My gut tells me that I changed how \n was defined and how Perl eats files. I'm chewing through files line by line using

while (<FILE>) { #Random code }
Works fine on my sample set, until I realize the working data actually consists of text files from a variety of flavors of Windows and Unix. Oops! Now I can't eat the files line by line. I could eat slurp the files whole, do a bit of substitution for \r & \n and deal with it that way. But that's an ugly solution and I can't gamble the files are consistently small enough to gobble up like that. Can someone point me in the right direction? I've been mucking with the Camel book for a while and my eyes are starting to water.

Edit: Fixed incorrect code example

----
Thanks for your patience.

Replies are listed 'Best First'.
Re: Unix and Windows CRLF vs LF
by cdarke (Prior) on May 13, 2009 at 11:28 UTC
    Presumably you have no way of knowing if the input records are terminated with \r\n or just \n without reading the records. Therefore I wouldn't alter $/ but instead:
    while (<FILE>) { s/\r?\n$//; # If there is no \r, so what? # Random code }
Re: Unix and Windows CRLF vs LF
by Burak (Chaplain) on May 13, 2009 at 06:02 UTC
    Using foreach on the FH is not efficient. Use while instead. And perl will automatically convert the line ending AFAIK. Try to use chomp:
    while ( my $line = <FILE> ) { chomp $line; # remove new line #Random code }
      to add more value to Burak's post
      chomp $line; # remove new line

      from chomp
      This safer version of "chop" removes any trailing string that corresponds to the current value of $/ (also known as $INPUT_RECORD_SEPARATOR in the English module).


      Vivek
      -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
      Ooops!! I meant to write While, rather than for each. I don't think for each in my code even works properly as it is written. I will update and make note.

      On a side note, why is For Each less efficient than While? Aren't they the same in this context?

      ----
      Thanks for your patience.

        foreach will make a list out of the file's contents and then will iterate, so it will read the whole file into memory. while has a small footprint since it goes through the contents line by line plus it terminates when the the condition will evaluate to false;with foreach you go through everything in the list
Re: Unix and Windows CRLF vs LF
by planetscape (Chancellor) on May 13, 2009 at 15:56 UTC
Re: Unix and Windows CRLF vs LF
by ikegami (Patriarch) on May 13, 2009 at 16:17 UTC
    On a Windows system (without binmode),
    while (<$fh>) { # If the file contained lines ending with CRLF, $_ ends with LF # If the file contained lines ending with LF, $_ ends with LF chomp; # Removes LF }
    On other systems,
    while (<$fh>) { # If the file contained lines ending with CRLF, $_ ends with CRLF # If the file contained lines ending with LF, $_ ends with LF s/\r?\n\z//; # Removes CRLF or LF }

    And since the latter works on Windows as well, you can just use it everywhere.

    The only systems where this doesn't work are old Macs.

Re: Unix and Windows CRLF vs LF
by bobf (Monsignor) on May 15, 2009 at 03:25 UTC

    I asked a similar question in Newlines: reading files that were created on other platforms. In summary, I started with these options:

    1. Use $^O, but if I understand it correctly that will just tell me about the system the program is running on, which (as exemplified here) is not necessarily the same as the system that created the file.
    2. Use a regex to match the newline character(s) in the file. I think this would require slurping the whole file and then doing something like if( $file =~ m/\015$/ ) (which assumes the file will end with a newline) or if( $file =~ m/\015(?!\012)/ ) (which doesn't), setting $/ according to what matched, and re-reading the file line-by-line.
    3. Preprocess the input file to convert all newline characters to the current system's newline character. I experimented a little, and I think this will work:
      $file =~ s[(\015)?\012(?!\015)][\n]g; $file =~ s[(\012)?\015(?!\012)][\n]g;

    I ended up implementing the preprocessing solution, but I would probably use binmode if I were to do it today.

Re: Unix and Windows CRLF vs LF
by SavannahLion (Pilgrim) on Jun 05, 2009 at 06:40 UTC
    Thank you for your help. Ultimately, I found a hint at http://perldoc.perl.org/perlfaq6.html describing a method on file streaming. The specific code example I ended up studying (took me about three hours of reading to figure out what (?s) did. Another hour to figure out why $_ needed to be localized.) is:
    local $_ = ""; while( sysread FH, $_, 8192, length ) { while( s/^((?s).*?)your_pattern/ ) { my $record = $1; # do stuff here. } }
    Yes, I know the bug in this example. It's exactly how it is on the perldocs. Took me about an half hour of debugging to figure it out. Just about drove me nuts.
    In any case, many thanks for the pointers, tips and hints. I needed some starting points and that is exactly what I got.

      Another hour to figure out why $_ needed to be localized

      Because you change its value. Clobbering your caller's variables isn't nice.

      Yes, I know the bug in this example. It's exactly how it is on the perldocs

      perlbug