ashok has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I am having three different files. 'File1.c' is written in C language , 'File2.cpp' is written in C++ language and 'File3.dat' is a datafile(say Ascii). On Unix when I try to test the file type without opening the file using the command
file File1.c
The answaer is
File1.c: ascii text
It is the same for all three files.
Then I wrote a perl script checking whether any C type comments(/* */) or C++ comments(//) existing.
Is there any betterway(any perl function) to check the file belongs to which language C/C++ or a textfile without opening it?
Thanks
Ashok
Re: file type
by chipmunk (Parson) on Feb 09, 2001 at 03:45 UTC
|
Update: For monks who are unfamiliar with the Unix file command, it is a utility that examines the contents of a file and reports the file type. It works on both binary and text files, using magic numbers and/or string tables to make its determination. Here's an example:
% file /usr/bin/perl
/usr/bin/perl: ELF 32-bit LSB executable, Intel 80386, version 1, dyna
+mically linked (uses shared libs), stripped
I tried running file on a few of my C source files, and got back 'C program text'. So file can figure out when an ASCII file consists of C code.
Looking at the manpage for file (on RedHat), I see:
If an argument appears to be an ASCII file, file attempts
to guess its language. The language tests look for par
ticular strings (cf names.h) that can appear anywhere in
the first few blocks of a file. For example, the keyword
.br indicates that the file is most likely a troff(1)
input file, just as the keyword struct indicates a C pro
gram. These tests are less reliable than the previous two
groups, so they are performed last. The language test
routines also test for some miscellany (such as tar(1)
archives) and determine whether an unknown file should be
labelled as `ascii text' or `data'.
Two observations: file guesses on the contents of an ASCII file by examining the first few blocks of the file, and the files you tested this on happen not to contain the strings that file is looking for.
I'm not sure what the best solution is. The file extensions will certainly be useful, as others have suggested.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Yes,
I came across new things. As you suggested I unable to find any other betterway. On my unix system File::MMagic is not available. So I am opening each file and searching for either c or C++ type comments and deciding the language it belongs. I unable to go by file extention. Since I came across with various extentions like .c, .C, .cpp, .CPP, .cxx,.CXX,.pc, .PC etc.
Thanks
Ashok
| [reply] [Watch: Dir/Any] |
|
As best as I can tell, Perl's built-in grep command does not look inside an unopened file for you.
But if that is what you want to do, why not use grep and the backtick operator? It is even possible that Grep may avoid scanning in a whole file if it knows that a pattern has failed halfway through.
Even if grep doesn't exist on a Windows system you can always use the GNU port of grep which is available here.
| [reply] [Watch: Dir/Any] |
Re: file type
by arturo (Vicar) on Feb 09, 2001 at 03:21 UTC
|
The problem is that they *ARE* all ASCII text files. The filesystem doesn't know the difference between C code, C++, Perl, Pascal, Fortran, or Java source for that matter.
Why not take the 'extensions' (what follows the last .) as good (at least preliminary) guides to the content types? You'd probably catch most of them that
way. The problem is that recognizing which programming language a file is source code for is a tremendously difficult task; you'll probably have to rely on a human to
figure it out reliably.
Philosophy can be made out of anything. Or less -- Jerry A. Fodor
| [reply] [Watch: Dir/Any] |
Re: file type
by eg (Friar) on Feb 09, 2001 at 03:22 UTC
|
You can use File::MMagic, but be forewarned, I didn't find it very fast.
Update: ah, good point Arturo. I hadn't fully comprehended the question's substance. Catagorizing by filename extension seems like the best solution.
| [reply] [Watch: Dir/Any] |
|
That's an interesting approach, but I'd guess it's designed to figure out MIME types. I dunno if different kinds of source code are going to have distinct MIME types (but maybe the module could be hacked so as to recognize them ...). I also note that it seems to work its magic based on the extension, primarily, which is what I suggested above.
Philosophy can be made out of anything. Or less -- Jerry A. Fodor
| [reply] [Watch: Dir/Any] |
Re: file type
by ichimunki (Priest) on Feb 09, 2001 at 03:44 UTC
|
%extensions = ( 'c' => 'C Source',
'cpp' => 'C++ Source',
'pl' => 'Perl Script',
'dat' => 'Data (Text?) File',
'csv' => 'Comma Separated Values',
'txt' => 'Text'
);
are what I would use. If you can't trust people to use the extensions appropriately, there is no other Unix way to differentiate (that I know of). Files are either binary or ascii (Update: AgentM has a good point below, and I don't want to belabor the issue). End of story. If you are intrepid, I suppose you could check the C source for things like #include <stdio.h> just for kicks, but I wouldn't rely on that any more than I would rely on #!/usr/bin/perl to indicate a Perl script. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Actually, that's not entirely correct. Under UNIX, a file is a file is a file. There is no differentiation between binary and ASCII- that's why the standard UNIX streams have no need for binmode. A simple stream of bits is what you get when you read from a stream. DOS-based systems are the ones that require binmode and they differentiate between ASCII and binary even in the characters that are used in similar representations. PAGERs like to make a good guess and warn you if you try to PAGER an actual binary file but it does this by reading the file a bit and checking for non-ascii character bytes. In fact, under UNIX, it is entirely impossible to tell what a stream-type file is (devices and ttys are easy to check for).
AgentM Systems nor Nasca Enterprises nor
Bone::Easy nor Macperl is responsible for the
comments made by
AgentM. Remember, you can build any logical system with NOR.
| [reply] [Watch: Dir/Any] |
|
And another point, which I came across while reading Learning Perl, is that Perl itself does not differentiate between binary and text files. I can read a binary file into a string and manipulate the string value, truncate it, etc. just as I can read a text file into a string.
The authors say that this feature of Perl is a result of Perl using a full byte (256 possible permutations of the 8 bits) to store each character of a string. ASCII characters only take up 7 bits apiece, so the ASCII character set is incapable of easily representing a binary file. It is hard to say why Larry chose to represent characters as 256 bits apiece, thus allowing strings to contain a binary file.
I suspect it had more to do with a need to represent Asian languages than it did with any desire to store binary files as strings.
| [reply] [Watch: Dir/Any] |
|
|