Laurent_R has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I need to extract some textual information from a large bunch of MS Word files, mostly at the .doc format (i.e. Office 2003 version or before, if I remember correctly); there also may be some .docx files but let's concentrate on the .doc format for now.

This is quite new to me, I've never had before to do that, and I am usually not working on Windows platform and Microsoft Office document formats, except for a few cases with Excel documents. So I am not sure how to handle this requirement.

I also found that there seem to be surprisingly few CPAN solutions for extracting data from MS Word documents (especially in the pre-2003 format).

The Text::Extract::Word module may be doing what I need, but it has a dependency on the OLE::Storage_Lite module (and perhaps some others). My understanding is the OLE::... domain modules will only work on Windows boxes (because they are presumably using Windows and/or MS Office libraries). Is this a correct assumption? If such is the case, then it would be a problem for us, because, ultimately, the whole much larger process should run on a Linux server.

Does any monk here have some experience on this type of needs, i.e., to summarize, how to extract textual data from MS Word files (pre-2003 format) on a Linux box, where OLE modules would presumably not run? Any suggestions? Any other module that I overlooked? Thank you very much for any help in this respect.

Replies are listed 'Best First'.
Re: Extracting text from MS Word files on a Linux box
by haukex (Archbishop) on Jun 21, 2018 at 11:04 UTC

    The following works for me with LibreOffice 5.1:

    use IPC::System::Simple qw/capturex/; my $text = capturex('libreoffice', '--convert-to', 'txt:Text (encoded):UTF8', $filename, '--cat', '--headless'); utf8::decode($text); $text=~s/\A\x{FEFF}//; # remove BOM
      Thank you very much haukex for your suggestion, I'll try it, but I suspect that this might very well work with recent .docx files (which have a format very similar to the open office format), but probably not with the old proprietary binary format associated with MS Office of 2003 and before. I'll give a try anyway.

        I tested with an older format .doc file (not .docx), and AFAIK LibreOffice supports both the older and newer formats.

Re: Extracting text from MS Word files on a Linux box
by aitap (Curate) on Jun 21, 2018 at 13:31 UTC

    If text is all you need (no formatting), you may have success with piping from Antiword (and docx2txt for later versions of the format).

    LibreOffice Writer used to support the older .DOC format better than the newer .DOCX; the situation may have changed since, but in general case, you should assume that you are going to lose some formatting information.

      Yes, I do not need any formatting, but just plain text, and Antiword (which I had never heard about before) seems to produce exactly what I need. The result is actually very clean (surprisingly clean).

      Thank you very much, aitap, I think it is likely we will go for a solution using that.

Re: Extracting text from MS Word files on a Linux box
by hippo (Bishop) on Jun 21, 2018 at 11:09 UTC

    Have you tried strings? Always used to do the trick before the MS format changed.

      Have you tried strings? Always used to do the trick before the MS format changed.

      docx is just a bunch of zipped XML files and some misc files. strings will fail due to ZIP, but once unpacked, strings will happily dig through the XML files.


      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      I just did not think about it. That's a very good idea, I'll try it. I don't know how it works under the hood, but I know that the Linux grep command is able to find strings in a MS word file, so, if it works similarly, the Linux string command might be all I need.

      Thanks hippo.