http://qs321.pair.com?node_id=1118339


in reply to How to determine type of line endings in a text file from within a script

Do you have some sort of criterion for deciding whether or not a given file is a "text file"?

A general solution that I've used is to scan each file for all occurrences of "\x0a" and "\x0d" (LF and CR), and report their statistics in a useful way. I happen to have written a script some years ago to do just this, so I'll post it here:

#!/usr/bin/perl use strict; use warnings; die "Usage: $0 filename [filename ...]\n" unless @ARGV and -f $ARGV[0]; for my $file ( @ARGV ) { my ( $cr, $lf, $crlf ) = ( 0 ) x 3; unless ( open I, $file ) { warn "can't open $file: $!\n"; next; } binmode I; ## (added as an update) $_ = " "; while ( read I, $_, 65536, 1 ) { $lf += tr/\x0a/\x0a/; $cr += tr/\x0d/\x0d/; $crlf += s/\x0d\x0a/xx/g ; $_ = chop; $cr-- if ( $_ eq "\x0d" ); # a final CR or LF will get counte +d $lf-- if ( $_ eq "\x0a" ); # again on the next iteration } $cr++ if ( $_ eq "\x0d" ); $lf++ if ( $_ eq "\x0a" ); print "$file: $cr CR, $lf LF, $crlf CRLF\n"; } =head1 NAME chk-crlf =head1 SYNOPSIS chk-crlf filename [filename ...] =head1 DESCRIPTION This program will read through one or more files named on the command line, and for each one, it prints to STDOUT a one-line report showing the total quantities of carriage-return (CR) and line-feed (LF) bytes, along with the number of byte pairs that are CRLF sequences, like this: unix-file1.txt: 0 CR, 80 LF, 0 CRLF dos-file1.txt: 80 CR, 80 LF, 80 CRLF binary-file.gz: 31 CR, 28 LF, 2 CRLF This is handy for confirming any expectations you may have about the nature of the file's content regarding line-termination characters. =head2 Valid outcomes If the three quantities are all equal, you have a valid MS-DOS "text mode" (or internet format) file: every CR and LF in the data is part of a CRLF pair. If the number of CR bytes (and CRLF pairs) is zero, you have a valid "unix style" text file. If there are slightly different quantities of CR and LF, and very few CRLF pairs, you are probably looking at non-text data (e.g. audio, image, or some form of compressed data). This in itself is not a problem, if the file is supposed to have non-text content. =head2 Not-so-valid outcomes If there are more LF's than CR's, but all the CR's are involved in CRLF pairs (CR < LF, CR == CRLF), you probably have a "hybrid" text fi +le: a unix system created some of the lines, and incorporated lines from some MS-DOS-like source without normalizing the line termination. This might not be a problem, but you may want to make the line termination consistent to avoid problems for some kinds of processing. If there are more CR's than LF's (e.g. roughly twice as many), but all the LF's are involved in CRLF pairs (CR > LF, LF == CRLF), you might be looking at a file that is supposed to have non-text content, but has gone through a 'unix2dos' text-mode conversion, whereby all LF bytes (or all that were not originally preceded by CR) have been replaced by CRLF byte pairs. (Or you might be looking at a DOS-like text file that happens to have extra CR characters embedded in some of the lines.) Usually, any sort of non-text file that has been through a unix2dos te +xt mode conversion is hopelessly corrupted and unusable -- there may be no way of undoing the alteration, because it may be impossible to know which LF characters were preceded by CR in the original (uncorrupted) version of the data (as opposed to having a CR inserted by the conversion). If you can, try to find a prior version of the file that has not been affected by the conversion. =cut
UPDATE: I added binmode I; -- the original script had been written for use on unix/linux, but I presume the binmode call would be needed if running under ms-windows (which I don't use).

ANOTHER UPDATE: The pod above doesn't mention this (and maybe there's a diminishing need to mention it), but there's one other distinctive outcome that could show up: CR=LF, but CRLF=0. This is what you'd get from a "text" file that contains regular CRLF line terminations, but is encoded as UTF16 (whether big- or little-endian). There's also still some chance of seeing CR>0 and LF=0 (old-style Macintosh line terminations).