Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Extracting information from a PDF file

by Perlbotics (Archbishop)
on Aug 20, 2008 at 21:59 UTC ( [id://705642]=note: print w/replies, xml ) Need Help??


in reply to [Updated] Extracting information from a PDF file

I am not aware of a CPAN-module that offers a kind of extract_table(page => 42, row => 1, column => 3); method. Creating that wouldn't be easy since the PDF-operators a more like plotter commands plotting on a sheet of paper, so there is no markup like a <TABLE> in HTML which defines some embedded object.

Are your PDF files generated automatically, that is to say in a repeatable fashion? I once managed to extract table based information from a series of automatically generated PDF files after converting them into Postscript using pdftops (not: pdf2ps) and some heuristics. Quite a game of chance... but maybe it works for you too?

Same approach: CAM::PDF comes with a tool rewritepdf.pl which allows to decompress the internal object streams (-d switch). Analysing the decompressed PDF file might give some hints. A typical table ENTRY might be embedded like this:

40 0 Td          <-- x, y position (Td: goto text position)
(ENTRY)Tj        <-- ENTRY         (Tj: show text
The Wikipedia entry for PDF provides a link to "Portable Document Format: An Introduction for Programmers" which provides a lightweight introduction and a table with common PDF operators.

Update: argl, it's rewritepdf.pl

Replies are listed 'Best First'.
Re^2: Extracting information from a PDF file
by Lawliet (Curate) on Aug 20, 2008 at 22:08 UTC
    "Are your PDF files generated automatically?"

    Nope, it is just one ill-made file.

    I'll look into rwritepdf.pl, as well as the article on Wikipedia.

    Update: Hmm, I get the same sort of output I get when trying to print the page's content. Example:

    Is it encoded unconventionally? Not sure what to do now.

    I'm so adjective, I verb nouns!

    chomp; # nom nom nom

      Hm, ... worst-case scenario: Your table is an embedded image. But I cannot judge that from your update. It might be a logo and the data is still somewhere...?

      Other options: OCR document / contact author.

        I converted it to an html document. Much easier to parse :P

        I'm so adjective, I verb nouns!

        chomp; # nom nom nom

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://705642]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-25 09:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found