Re: How to Extract PDF tables using Perl

Hi, the solution i've seen is to use:

$doc->getPageContent($pagenum);
[download]

instead of:

$doc->getPageText($pagenum);
[download]

But even if the solution sounds simple. There is work for you to do.

You will have to parse the return value of getPageContent.

Here is an Possible Example of PageContent:

9.9213 0 Td
Content Tj
[download]

The 2 Numbers before the Td tell you the Position of the Content.

UPDATE: This gives you a HashRef of your Page:

$doc->getPageContentTree($pagenum)
[download]

Comment on Re: How to Extract PDF tables using Perl Select or Download Code

Replies are listed 'Best First'.
Re^2: How to Extract PDF tables using Perl by LanX (Saint) on May 27, 2016 at 13:22 UTC
Td means table data like in html and "Tj" is the cell's text??? The PDF you are parsing seems to have preserved semantic information, I suppose this approach depends on the way it was generated. I doubt this is generally true. (?) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^3: How to Extract PDF tables using Perl by ablanke (Monsignor) on May 27, 2016 at 13:38 UTC
Yes, that's right. It always depends on the way the PDF was generated. (some PDF tools even position every single character) Maybe the `getPageContentTree` method helps to build a more generally solution. The example based on the solution i've seen.	[reply] [d/l]
Re^4: How to Extract PDF tables using Perl by LanX (Saint) on May 27, 2016 at 15:51 UTC
Thanks that's interesting ... I'll give it a try next time I need to parse PDF. :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]