Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: (Get text of Word Document)going through a Win32 MSWORD doc

by buzzcutbuddha (Chaplain)
on Dec 20, 2001 at 22:28 UTC ( [id://133555]=note: print w/replies, xml ) Need Help??

in reply to going through a Win32 MSWORD doc

If what you meant by indexing is grabbing each word in a Word Document and creating an index of those words to find them later, the following will give you all of the words in a document. I'll let you focus on the indexing part. :)

#!/usr/bin/perl # general use directives use strict; use warnings; # project specific use directives # this comes with the standard ActiveState # distribution. You can also look for # a newer version with PPM use Win32::OLE; my $wd; # get the document # use the full path eval { $wd = Win32::OLE->GetObject('C:/pathto/document/foo.doc') }; die "Unable to load document\n" if $@; # all of the Word document data members I'm using # are explained in the MSDN documentation of the # external interfaces of a Word Document. # if you have MSDN, search for "Word OLE". # get the number of paragraphs my $paraCount = $wd->{Paragraphs}->Count; # set the counter my $foo = 0; my @words; while ($foo++ < $paraCount) { push @words, split /\s/, $wd->{Paragraphs}{$foo}{Range}{Text}; } #clean up at the end undef $wd;
That's how you get the words of a word document out and into an array. You may prefer a different data structure, but again, I'll leave that up to you! I hope this helps.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://133555]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (8)
As of 2024-05-20 16:24 GMT
Find Nodes?
    Voting Booth?

    No recent polls found