Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: How to Extract PDF tables using Perl

by ablanke (Monsignor)
on May 27, 2016 at 12:58 UTC ( [id://1164304]=note: print w/replies, xml ) Need Help??


in reply to How to Extract PDF tables using Perl

Hi, the solution i've seen is to use:
$doc->getPageContent($pagenum);
instead of:
$doc->getPageText($pagenum);

But even if the solution sounds simple. There is work for you to do.

You will have to parse the return value of getPageContent.

Here is an Possible Example of PageContent:

9.9213 0 Td Content Tj

The 2 Numbers before the Td tell you the Position of the Content.

UPDATE: This gives you a HashRef of your Page:
$doc->getPageContentTree($pagenum)

Replies are listed 'Best First'.
Re^2: How to Extract PDF tables using Perl
by LanX (Saint) on May 27, 2016 at 13:22 UTC
    Td means table data like in html and "Tj" is the cell's text???

    The PDF you are parsing seems to have preserved semantic information, I suppose this approach depends on the way it was generated.

    I doubt this is generally true. (?)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Yes, that's right.

      It always depends on the way the PDF was generated. (some PDF tools even position every single character)

      Maybe the getPageContentTree method helps to build a more generally solution.

      The example based on the solution i've seen.

        Thanks that's interesting ... I'll give it a try next time I need to parse PDF. :)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1164304]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-26 00:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found