Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hi, I have a list of files without any extension, I'd like to recognize the subset of Word and XLS files from the file content. I've tried to use File::MMagic, which uses the magic numbers, but they can't seem to distinguish Word from XLS files, both returned as msword. File::LibMagic is a little complicated to compile for my machine. Is there an easy way to recognize these two types? Really appreciate it.
Re: How to recognize Word and XLS files
by CountZero (Bishop) on Mar 06, 2012 at 07:10 UTC
|
Office Binary File Formats describes the file formats of Word, Excel and Powerpoint. You can always open the file to be tested in binary mode and check if the structure is one of these (or other MS) files.The key is in the "File Information Block". For Word-files, there is an ident structure in the first 32 bytes of the file. If the first two bytes are 0xA5EC, then the file is a Word-file (FibBase-structure). Now the only problem is to know where this File Information Black starts. It may be at the very beginning of the file, but I am not sure ...
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] |
Re: How to recognize Word and XLS files
by jmcnamara (Monsignor) on Mar 06, 2012 at 09:46 UTC
|
Here is a small program that uses OLE::Storage_Lite to distinguish distinguish Microsoft doc and xls files.
#!/usr/bin/perl
use strict;
use warnings;
use OLE::Storage_Lite;
my @files = (
'test.xls',
'test.doc',
'test.ppt',
'test.txt',
);
for my $filename ( @files ) {
printf( "%-20s = %s\n", $filename, check_ole_filetype( $filena
+me ) );
}
sub check_ole_filetype {
my $filename = shift;
# Check that the file exists.
return 'not_found' if !-e $filename;
# Create an OLE::Storage_Lite object to read the file.
my $ole = OLE::Storage_Lite->new( $filename );
my $pps = $ole->getPpsTree();
# If getPpsTree() failed then this isn't an OLE file.
return 'not_ole_file' if !$pps;
# Loop through the PPS children below the root.
for my $child_pps ( @{ $pps->{Child} } ) {
my $pps_name = OLE::Storage_Lite::Ucs2Asc( $child_pps->{Na
+me} );
# Match an Excel xls file.
if ( $pps_name eq 'Workbook' || $pps_name eq 'Book' ) {
return 'xls';
}
# Match a Word document.
if ( $pps_name eq 'WordDocument') {
return 'doc';
}
}
return 'unknown_ole_file';
}
__END__
Output:
$ perl ole_check.pl
test.xls = xls
test.doc = doc
test.ppt = unknown_ole_file
test.txt = not_ole_file
You will probably have to harden it a little for your needs. For example it is possible that some older Word files might have a differed $pps_name. A little testing should highlight if that is the case. Also, this won't find Office 2007+ style docx or xlsx files.
--
John.
| [reply] [d/l] |
|
$ mimetype junk.doc
junk.doc: application/msword
$ mimetype junk.xls
junk.xls: application/vnd.ms-excel
$ mimetype junk.xlsx
junk.xlsx: application/zip
$ mimetype junk.docx
junk.xlsx: application/zip
| [reply] [d/l] |
Re: How to recognize Word and XLS files
by Marshall (Canon) on Mar 06, 2012 at 07:00 UTC
|
I found this link: file sigs.
starting with: D0 CF 11 E0 A1 B1 1A E1
DOC, DOT, PPS, PPT, XLA, XLS, WIZ Microsoft Office applications (Word, Powerpoint, Excel, Wizard)
(See also Word, Powerpoint, and Excel "subheaders" at byte offset 512)
So evidently the "magic" at the beginning just says that this is a MS Office document. There is an additional field at byte offset 512 that gives the sub-type. There appears to be some freeware apps at that link that will figure this out. If that fails, I guess you could do some hacking to figure out what the difference is at byte 512 yourself.
Update: I did some hacking around on some of my files. I suspect that different versions of MS Office have different formats - I'm not finding the signatures that some folks claim should be there in my Excel 2000 file - but maybe my brain isn't counting byte offsets right this late evening! However, I did notice that
57006F00 72006B00 62006F00 6F006B00 W.o.r.k.b.o.o.k
appears in my .XLS file. So maybe worst case, there is some adhoc string that can be found to search for that will give you the answer you need? | [reply] [d/l] |
Re: How to recognize Word and XLS files
by jmcnamara (Monsignor) on Mar 06, 2012 at 11:32 UTC
|
A better, and more complete way than the OLE version above, is using Image::ExifTool's ImageInfo. This also identifies the new Office file formats with or without extensions:
#!/usr/bin/perl
use strict;
use warnings;
use Image::ExifTool 'ImageInfo';
my @files = (
'test.xls', 'test.xlsx',
'test.doc', 'test.docx',
'test.ppt', 'test.pptx',
);
for my $filename ( @files ) {
my $info = ImageInfo( $filename );
printf( "%-20s = %s\n", $filename, $info->{FileType} );
}
__END__
Output:
$ perl exif_check.pl
test.xls = XLS
test.xlsx = XLSX
test.doc = DOC
test.docx = DOCX
test.ppt = PPT
test.pptx = PPTX
It might seem a little odd using Image::ExifTool for this but it has a large number of recognised formats.
--
John.
| [reply] [d/l] |
Re: How to recognize Word and XLS files
by Anonymous Monk on Mar 06, 2012 at 06:26 UTC
|
| [reply] |
|
|