Extracting text from MS Word files on a Linux box

Laurent_R has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I need to extract some textual information from a large bunch of MS Word files, mostly at the .doc format (i.e. Office 2003 version or before, if I remember correctly); there also may be some .docx files but let's concentrate on the .doc format for now.

This is quite new to me, I've never had before to do that, and I am usually not working on Windows platform and Microsoft Office document formats, except for a few cases with Excel documents. So I am not sure how to handle this requirement.

I also found that there seem to be surprisingly few CPAN solutions for extracting data from MS Word documents (especially in the pre-2003 format).

The Text::Extract::Word module may be doing what I need, but it has a dependency on the OLE::Storage_Lite module (and perhaps some others). My understanding is the OLE::... domain modules will only work on Windows boxes (because they are presumably using Windows and/or MS Office libraries). Is this a correct assumption? If such is the case, then it would be a problem for us, because, ultimately, the whole much larger process should run on a Linux server.

Does any monk here have some experience on this type of needs, i.e., to summarize, how to extract textual data from MS Word files (pre-2003 format) on a Linux box, where OLE modules would presumably not run? Any suggestions? Any other module that I overlooked? Thank you very much for any help in this respect.

Comment on Extracting text from MS Word files on a Linux box Select or Download Code

Replies are listed 'Best First'.
Re: Extracting text from MS Word files on a Linux box by haukex (Archbishop) on Jun 21, 2018 at 11:04 UTC
The following works for me with LibreOffice 5.1: `use IPC::System::Simple qw/capturex/; my $text = capturex('libreoffice', '--convert-to', 'txt:Text (encoded):UTF8', $filename, '--cat', '--headless'); utf8::decode($text); $text=~s/\A\x{FEFF}//; # remove BOM` [download]	[reply] [d/l]
Re^2: Extracting text from MS Word files on a Linux box by Laurent_R (Canon) on Jun 21, 2018 at 11:50 UTC
Thank you very much haukex for your suggestion, I'll try it, but I suspect that this might very well work with recent `.docx` files (which have a format very similar to the open office format), but probably not with the old proprietary binary format associated with MS Office of 2003 and before. I'll give a try anyway.	[reply] [d/l]
Re^3: Extracting text from MS Word files on a Linux box by haukex (Archbishop) on Jun 21, 2018 at 11:57 UTC
I tested with an older format `.doc` file (not `.docx`), and AFAIK LibreOffice supports both the older and newer formats.	[reply] [d/l] [select]
Re: Extracting text from MS Word files on a Linux box by aitap (Curate) on Jun 21, 2018 at 13:31 UTC
If text is all you need (no formatting), you may have success with piping from Antiword (and docx2txt for later versions of the format). LibreOffice Writer used to support the older .DOC format better than the newer .DOCX; the situation may have changed since, but in general case, you should assume that you are going to lose some formatting information.	[reply]
Re^2: Extracting text from MS Word files on a Linux box by Laurent_R (Canon) on Jun 21, 2018 at 15:44 UTC
Yes, I do not need any formatting, but just plain text, and Antiword (which I had never heard about before) seems to produce exactly what I need. The result is actually very clean (surprisingly clean). Thank you very much, aitap, I think it is likely we will go for a solution using that.	[reply]
Re: Extracting text from MS Word files on a Linux box by hippo (Bishop) on Jun 21, 2018 at 11:09 UTC
Have you tried strings? Always used to do the trick before the MS format changed.	[reply]
Re^2: Extracting text from MS Word files on a Linux box by afoken (Chancellor) on Jun 21, 2018 at 20:18 UTC
Have you tried strings? Always used to do the trick before the MS format changed. `docx` is just a bunch of zipped XML files and some misc files. strings will fail due to ZIP, but once unpacked, strings will happily dig through the XML files. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l]
Re^2: Extracting text from MS Word files on a Linux box by Laurent_R (Canon) on Jun 21, 2018 at 11:55 UTC
I just did not think about it. That's a very good idea, I'll try it. I don't know how it works under the hood, but I know that the Linux `grep` command is able to find strings in a MS word file, so, if it works similarly, the Linux `string` command might be all I need. Thanks hippo.	[reply] [d/l] [select]


Just another Perl shrine
	PerlMonks