Here is a small program that uses OLE::Storage_Lite to distinguish distinguish Microsoft doc and xls files.
#!/usr/bin/perl
use strict;
use warnings;
use OLE::Storage_Lite;
my @files = (
'test.xls',
'test.doc',
'test.ppt',
'test.txt',
);
for my $filename ( @files ) {
printf( "%-20s = %s\n", $filename, check_ole_filetype( $filena
+me ) );
}
sub check_ole_filetype {
my $filename = shift;
# Check that the file exists.
return 'not_found' if !-e $filename;
# Create an OLE::Storage_Lite object to read the file.
my $ole = OLE::Storage_Lite->new( $filename );
my $pps = $ole->getPpsTree();
# If getPpsTree() failed then this isn't an OLE file.
return 'not_ole_file' if !$pps;
# Loop through the PPS children below the root.
for my $child_pps ( @{ $pps->{Child} } ) {
my $pps_name = OLE::Storage_Lite::Ucs2Asc( $child_pps->{Na
+me} );
# Match an Excel xls file.
if ( $pps_name eq 'Workbook' || $pps_name eq 'Book' ) {
return 'xls';
}
# Match a Word document.
if ( $pps_name eq 'WordDocument') {
return 'doc';
}
}
return 'unknown_ole_file';
}
__END__
Output:
$ perl ole_check.pl
test.xls = xls
test.doc = doc
test.ppt = unknown_ole_file
test.txt = not_ole_file
You will probably have to harden it a little for your needs. For example it is possible that some older Word files might have a differed $pps_name. A little testing should highlight if that is the case. Also, this won't find Office 2007+ style docx or xlsx files.
--
John.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|