Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Extracting text from MS Word files on a Linux box

by aitap (Curate)
on Jun 21, 2018 at 13:31 UTC ( #1217117=note: print w/replies, xml ) Need Help??


in reply to Extracting text from MS Word files on a Linux box

If text is all you need (no formatting), you may have success with piping from Antiword (and docx2txt for later versions of the format).

LibreOffice Writer used to support the older .DOC format better than the newer .DOCX; the situation may have changed since, but in general case, you should assume that you are going to lose some formatting information.

  • Comment on Re: Extracting text from MS Word files on a Linux box

Replies are listed 'Best First'.
Re^2: Extracting text from MS Word files on a Linux box
by Laurent_R (Canon) on Jun 21, 2018 at 15:44 UTC
    Yes, I do not need any formatting, but just plain text, and Antiword (which I had never heard about before) seems to produce exactly what I need. The result is actually very clean (surprisingly clean).

    Thank you very much, aitap, I think it is likely we will go for a solution using that.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1217117]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2023-03-24 06:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (60 votes). Check out past polls.

    Notices?