#!/usr/bin/perl use strict; use warnings; die "Usage: $0 filename [filename ...]\n" unless @ARGV and -f $ARGV[0]; for my $file ( @ARGV ) { my ( $cr, $lf, $crlf ) = ( 0 ) x 3; unless ( open I, $file ) { warn "can't open $file: $!\n"; next; } binmode I; ## (added as an update) $_ = " "; while ( read I, $_, 65536, 1 ) { $lf += tr/\x0a/\x0a/; $cr += tr/\x0d/\x0d/; $crlf += s/\x0d\x0a/xx/g ; $_ = chop; $cr-- if ( $_ eq "\x0d" ); # a final CR or LF will get counted $lf-- if ( $_ eq "\x0a" ); # again on the next iteration } $cr++ if ( $_ eq "\x0d" ); $lf++ if ( $_ eq "\x0a" ); print "$file: $cr CR, $lf LF, $crlf CRLF\n"; } =head1 NAME chk-crlf =head1 SYNOPSIS chk-crlf filename [filename ...] =head1 DESCRIPTION This program will read through one or more files named on the command line, and for each one, it prints to STDOUT a one-line report showing the total quantities of carriage-return (CR) and line-feed (LF) bytes, along with the number of byte pairs that are CRLF sequences, like this: unix-file1.txt: 0 CR, 80 LF, 0 CRLF dos-file1.txt: 80 CR, 80 LF, 80 CRLF binary-file.gz: 31 CR, 28 LF, 2 CRLF This is handy for confirming any expectations you may have about the nature of the file's content regarding line-termination characters. =head2 Valid outcomes If the three quantities are all equal, you have a valid MS-DOS "text mode" (or internet format) file: every CR and LF in the data is part of a CRLF pair. If the number of CR bytes (and CRLF pairs) is zero, you have a valid "unix style" text file. If there are slightly different quantities of CR and LF, and very few CRLF pairs, you are probably looking at non-text data (e.g. audio, image, or some form of compressed data). This in itself is not a problem, if the file is supposed to have non-text content. =head2 Not-so-valid outcomes If there are more LF's than CR's, but all the CR's are involved in CRLF pairs (CR < LF, CR == CRLF), you probably have a "hybrid" text file: a unix system created some of the lines, and incorporated lines from some MS-DOS-like source without normalizing the line termination. This might not be a problem, but you may want to make the line termination consistent to avoid problems for some kinds of processing. If there are more CR's than LF's (e.g. roughly twice as many), but all the LF's are involved in CRLF pairs (CR > LF, LF == CRLF), you might be looking at a file that is supposed to have non-text content, but has gone through a 'unix2dos' text-mode conversion, whereby all LF bytes (or all that were not originally preceded by CR) have been replaced by CRLF byte pairs. (Or you might be looking at a DOS-like text file that happens to have extra CR characters embedded in some of the lines.) Usually, any sort of non-text file that has been through a unix2dos text mode conversion is hopelessly corrupted and unusable -- there may be no way of undoing the alteration, because it may be impossible to know which LF characters were preceded by CR in the original (uncorrupted) version of the data (as opposed to having a CR inserted by the conversion). If you can, try to find a prior version of the file that has not been affected by the conversion. =cut