Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

How Can I get the Word Document Page Number in Perl Using Linux?

by prabuvos (Initiate)
on Apr 16, 2013 at 13:24 UTC ( [id://1028914] : perlquestion . print w/replies, xml ) Need Help??

prabuvos has asked for the wisdom of the Perl Monks concerning the following question:

Hello Guys! I Need to Retrieve the Word Document Content and its page number in Perl Using Linux.

Replies are listed 'Best First'.
Re: How Can I get the Word Document Page Number in Perl Using Linux?
by dasgar (Priest) on Apr 16, 2013 at 15:52 UTC

    If I were on a Windows box that had Microsoft Word installed, I would use Win32::OLE to control Word to retrieve that information. However, that combo isn't available on Linux.

    One alternative would be to use OpenOffice or LibreOffice. Although I personally have never done so, I believe that they offer some kind of API that you could leverage from Perl similar to someone using OLE to control Microsoft Office software in Windows.

Re: How Can I get the Word Document Page Number in Perl Using Linux?
by hdb (Monsignor) on Apr 17, 2013 at 06:37 UTC

    More seriously, your problem has no solution that does not involve running your document through Word. I found this reference, page 16:

    It is worth mentioning that in Word, “pages” do not exist in the document file. Like a professional typesetter, Word makes up its pages on the fly when it displays or prints a document. Word uses measurements from the installed fonts and the installed printer driver to do this. It is almost impossible to get two machines so exactly similar that a document will paginate with exactly the same page breaks on each. Sometimes people complain that when they open the document on a different machine, some of the page numbers in the TOC or Index are “wrong”. They’re not: when the document is opened on the other machine, minute variations in set-up that do not show over a ten page memo will cause variations in the position of page breaks in a 1,000-page manual. If you remember to update the TOC and Index before you print, the problem corrects itself.

    So the page numbers do not exist in the document, therefore you cannot retrieve them to split the text into pages. Only Word can do that for you.

    PS.: Apologies for my little joke on the page numbers above...

      (Update: sorry, wrong parent, please reap)
Re: How Can I get the Word Document Page Number in Perl Using Linux?
by thezip (Vicar) on Apr 16, 2013 at 15:40 UTC

    I Want Pony!

    But seriously, what have you tried already?

    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens, 1949-2011
      Never ask for a pony when you can ask for a unicorn! :-)
      A Monk aims to give answers to those who have none, and to learn from those who know more.

        This gives the text but no page numbers.

        use strict; use warnings; use Text::Extract::Word; my $file = Text::Extract::Word->new("test1.doc"); my $raw = $file->get_text(':raw'); print $raw;

        Here is something that gives you the page numbers for a small document, say 5 pages:

        perl -e 'print "$_\n" for (1..5)'