http://qs321.pair.com?node_id=1164304


in reply to How to Extract PDF tables using Perl

Hi, the solution i've seen is to use:
$doc->getPageContent($pagenum);
instead of:
$doc->getPageText($pagenum);

But even if the solution sounds simple. There is work for you to do.

You will have to parse the return value of getPageContent.

Here is an Possible Example of PageContent:

9.9213 0 Td Content Tj

The 2 Numbers before the Td tell you the Position of the Content.

UPDATE: This gives you a HashRef of your Page:
$doc->getPageContentTree($pagenum)

Replies are listed 'Best First'.
Re^2: How to Extract PDF tables using Perl
by LanX (Saint) on May 27, 2016 at 13:22 UTC
    Td means table data like in html and "Tj" is the cell's text???

    The PDF you are parsing seems to have preserved semantic information, I suppose this approach depends on the way it was generated.

    I doubt this is generally true. (?)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Yes, that's right.

      It always depends on the way the PDF was generated. (some PDF tools even position every single character)

      Maybe the getPageContentTree method helps to build a more generally solution.

      The example based on the solution i've seen.

        Thanks that's interesting ... I'll give it a try next time I need to parse PDF. :)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!