Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

file type

by ashok (Sexton)
on Feb 09, 2001 at 03:14 UTC ( [id://57286]=perlquestion: print w/replies, xml ) Need Help??

ashok has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am having three different files. 'File1.c' is written in C language , 'File2.cpp' is written in C++ language and 'File3.dat' is a datafile(say Ascii). On Unix when I try to test the file type without opening the file using the command
file File1.c
The answaer is
File1.c: ascii text
It is the same for all three files. Then I wrote a perl script checking whether any C type comments(/* */) or C++ comments(//) existing. Is there any betterway(any perl function) to check the file belongs to which language C/C++ or a textfile without opening it?
Thanks Ashok

Replies are listed 'Best First'.
Re: file type
by chipmunk (Parson) on Feb 09, 2001 at 03:45 UTC
    Update: For monks who are unfamiliar with the Unix file command, it is a utility that examines the contents of a file and reports the file type. It works on both binary and text files, using magic numbers and/or string tables to make its determination. Here's an example:
    % file /usr/bin/perl /usr/bin/perl: ELF 32-bit LSB executable, Intel 80386, version 1, dyna +mically linked (uses shared libs), stripped

    I tried running file on a few of my C source files, and got back 'C program text'. So file can figure out when an ASCII file consists of C code.

    Looking at the manpage for file (on RedHat), I see:

    If an argument appears to be an ASCII file, file attempts to guess its language. The language tests look for par­ ticular strings (cf names.h) that can appear anywhere in the first few blocks of a file. For example, the keyword .br indicates that the file is most likely a troff(1) input file, just as the keyword struct indicates a C pro­ gram. These tests are less reliable than the previous two groups, so they are performed last. The language test routines also test for some miscellany (such as tar(1) archives) and determine whether an unknown file should be labelled as `ascii text' or `data'.
    Two observations: file guesses on the contents of an ASCII file by examining the first few blocks of the file, and the files you tested this on happen not to contain the strings that file is looking for.

    I'm not sure what the best solution is. The file extensions will certainly be useful, as others have suggested.

      Yes, I came across new things. As you suggested I unable to find any other betterway. On my unix system File::MMagic is not available. So I am opening each file and searching for either c or C++ type comments and deciding the language it belongs. I unable to go by file extention. Since I came across with various extentions like .c, .C, .cpp, .CPP, .cxx,.CXX,.pc, .PC etc. Thanks Ashok
        As best as I can tell, Perl's built-in grep command does not look inside an unopened file for you.

        But if that is what you want to do, why not use grep and the backtick operator? It is even possible that Grep may avoid scanning in a whole file if it knows that a pattern has failed halfway through.

        Even if grep doesn't exist on a Windows system you can always use the GNU port of grep which is available here.

Re: file type
by arturo (Vicar) on Feb 09, 2001 at 03:21 UTC

    The problem is that they *ARE* all ASCII text files. The filesystem doesn't know the difference between C code, C++, Perl, Pascal, Fortran, or Java source for that matter.

    Why not take the 'extensions' (what follows the last .) as good (at least preliminary) guides to the content types? You'd probably catch most of them that way. The problem is that recognizing which programming language a file is source code for is a tremendously difficult task; you'll probably have to rely on a human to figure it out reliably.

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: file type
by eg (Friar) on Feb 09, 2001 at 03:22 UTC

    You can use File::MMagic, but be forewarned, I didn't find it very fast.

    Update: ah, good point Arturo. I hadn't fully comprehended the question's substance. Catagorizing by filename extension seems like the best solution.

      That's an interesting approach, but I'd guess it's designed to figure out MIME types. I dunno if different kinds of source code are going to have distinct MIME types (but maybe the module could be hacked so as to recognize them ...). I also note that it seems to work its magic based on the extension, primarily, which is what I suggested above.

      Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: file type
by ichimunki (Priest) on Feb 09, 2001 at 03:44 UTC
    File::Basename and a hash like
    %extensions = ( 'c' => 'C Source', 'cpp' => 'C++ Source', 'pl' => 'Perl Script', 'dat' => 'Data (Text?) File', 'csv' => 'Comma Separated Values', 'txt' => 'Text' );
    are what I would use. If you can't trust people to use the extensions appropriately, there is no other Unix way to differentiate (that I know of). Files are either binary or ascii (Update: AgentM has a good point below, and I don't want to belabor the issue). End of story. If you are intrepid, I suppose you could check the C source for things like #include <stdio.h> just for kicks, but I wouldn't rely on that any more than I would rely on #!/usr/bin/perl to indicate a Perl script.
      Actually, that's not entirely correct. Under UNIX, a file is a file is a file. There is no differentiation between binary and ASCII- that's why the standard UNIX streams have no need for binmode. A simple stream of bits is what you get when you read from a stream. DOS-based systems are the ones that require binmode and they differentiate between ASCII and binary even in the characters that are used in similar representations. PAGERs like to make a good guess and warn you if you try to PAGER an actual binary file but it does this by reading the file a bit and checking for non-ascii character bytes. In fact, under UNIX, it is entirely impossible to tell what a stream-type file is (devices and ttys are easy to check for).
      AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.
        And another point, which I came across while reading Learning Perl, is that Perl itself does not differentiate between binary and text files. I can read a binary file into a string and manipulate the string value, truncate it, etc. just as I can read a text file into a string.

        The authors say that this feature of Perl is a result of Perl using a full byte (256 possible permutations of the 8 bits) to store each character of a string. ASCII characters only take up 7 bits apiece, so the ASCII character set is incapable of easily representing a binary file. It is hard to say why Larry chose to represent characters as 256 bits apiece, thus allowing strings to contain a binary file.

        I suspect it had more to do with a need to represent Asian languages than it did with any desire to store binary files as strings.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://57286]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-03-28 23:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found