Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

How to recognize Word and XLS files

by Anonymous Monk
on Mar 06, 2012 at 04:44 UTC ( [id://958016]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a list of files without any extension, I'd like to recognize the subset of Word and XLS files from the file content. I've tried to use File::MMagic, which uses the magic numbers, but they can't seem to distinguish Word from XLS files, both returned as msword. File::LibMagic is a little complicated to compile for my machine. Is there an easy way to recognize these two types? Really appreciate it.

Replies are listed 'Best First'.
Re: How to recognize Word and XLS files
by CountZero (Bishop) on Mar 06, 2012 at 07:10 UTC
    Office Binary File Formats describes the file formats of Word, Excel and Powerpoint. You can always open the file to be tested in binary mode and check if the structure is one of these (or other MS) files.

    The key is in the "File Information Block".

    For Word-files, there is an ident structure in the first 32 bytes of the file. If the first two bytes are 0xA5EC, then the file is a Word-file (FibBase-structure).

    Now the only problem is to know where this File Information Black starts. It may be at the very beginning of the file, but I am not sure ...

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: How to recognize Word and XLS files
by jmcnamara (Monsignor) on Mar 06, 2012 at 09:46 UTC

    Here is a small program that uses OLE::Storage_Lite to distinguish distinguish Microsoft doc and xls files.

    #!/usr/bin/perl use strict; use warnings; use OLE::Storage_Lite; my @files = ( 'test.xls', 'test.doc', 'test.ppt', 'test.txt', ); for my $filename ( @files ) { printf( "%-20s = %s\n", $filename, check_ole_filetype( $filena +me ) ); } sub check_ole_filetype { my $filename = shift; # Check that the file exists. return 'not_found' if !-e $filename; # Create an OLE::Storage_Lite object to read the file. my $ole = OLE::Storage_Lite->new( $filename ); my $pps = $ole->getPpsTree(); # If getPpsTree() failed then this isn't an OLE file. return 'not_ole_file' if !$pps; # Loop through the PPS children below the root. for my $child_pps ( @{ $pps->{Child} } ) { my $pps_name = OLE::Storage_Lite::Ucs2Asc( $child_pps->{Na +me} ); # Match an Excel xls file. if ( $pps_name eq 'Workbook' || $pps_name eq 'Book' ) { return 'xls'; } # Match a Word document. if ( $pps_name eq 'WordDocument') { return 'doc'; } } return 'unknown_ole_file'; } __END__ Output: $ perl ole_check.pl test.xls = xls test.doc = doc test.ppt = unknown_ole_file test.txt = not_ole_file

    You will probably have to harden it a little for your needs. For example it is possible that some older Word files might have a differed $pps_name. A little testing should highlight if that is the case. Also, this won't find Office 2007+ style docx or xlsx files.

    --
    John.

Re: How to recognize Word and XLS files
by Marshall (Canon) on Mar 06, 2012 at 07:00 UTC
    I found this link: file sigs.
    starting with: D0 CF 11 E0 A1 B1 1A E1
    DOC, DOT, PPS, PPT, XLA, XLS, WIZ Microsoft Office applications (Word, Powerpoint, Excel, Wizard) (See also Word, Powerpoint, and Excel "subheaders" at byte offset 512)
    So evidently the "magic" at the beginning just says that this is a MS Office document. There is an additional field at byte offset 512 that gives the sub-type. There appears to be some freeware apps at that link that will figure this out. If that fails, I guess you could do some hacking to figure out what the difference is at byte 512 yourself.

    Update: I did some hacking around on some of my files. I suspect that different versions of MS Office have different formats - I'm not finding the signatures that some folks claim should be there in my Excel 2000 file - but maybe my brain isn't counting byte offsets right this late evening! However, I did notice that
    57006F00 72006B00 62006F00 6F006B00 W.o.r.k.b.o.o.k
    appears in my .XLS file. So maybe worst case, there is some adhoc string that can be found to search for that will give you the answer you need?

Re: How to recognize Word and XLS files
by jmcnamara (Monsignor) on Mar 06, 2012 at 11:32 UTC

    A better, and more complete way than the OLE version above, is using Image::ExifTool's ImageInfo. This also identifies the new Office file formats with or without extensions:

    #!/usr/bin/perl use strict; use warnings; use Image::ExifTool 'ImageInfo'; my @files = ( 'test.xls', 'test.xlsx', 'test.doc', 'test.docx', 'test.ppt', 'test.pptx', ); for my $filename ( @files ) { my $info = ImageInfo( $filename ); printf( "%-20s = %s\n", $filename, $info->{FileType} ); } __END__ Output: $ perl exif_check.pl test.xls = XLS test.xlsx = XLSX test.doc = DOC test.docx = DOCX test.ppt = PPT test.pptx = PPTX

    It might seem a little odd using Image::ExifTool for this but it has a large number of recognised formats.

    --
    John.

Re: How to recognize Word and XLS files
by Anonymous Monk on Mar 06, 2012 at 06:26 UTC

    Is there an easy way to recognize these two types?

    Yeah, use file

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://958016]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-19 01:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found