http://qs321.pair.com?node_id=705597

Lawliet has asked for the wisdom of the Perl Monks concerning the following question:

To put it bluntly, I need to extract data from a pdf file.

More specifically; inside this two-page pdf file lies a 2-3 (it changes) column, multi-row table. Despite the oddly formatted table (you would have to see the document to understand what I mean, I guess), I believe I can parse it given the right module. The only one I see that may help is CAM::PDF. Do you know of anything that is more helpful for parsing pdf tables? Should I convert it to separate file format and go from there?

Update: Decided to just convert it to an html document, (thanks, Popcorn Dave), but thanks to all who helped. I am still willing to listen to any further suggestions if you have them, though.

I'm so adjective, I verb nouns!

chomp; # nom nom nom

Replies are listed 'Best First'.
Re: Extracting information from a PDF file
by Perlbotics (Archbishop) on Aug 20, 2008 at 21:59 UTC

    I am not aware of a CPAN-module that offers a kind of extract_table(page => 42, row => 1, column => 3); method. Creating that wouldn't be easy since the PDF-operators a more like plotter commands plotting on a sheet of paper, so there is no markup like a <TABLE> in HTML which defines some embedded object.

    Are your PDF files generated automatically, that is to say in a repeatable fashion? I once managed to extract table based information from a series of automatically generated PDF files after converting them into Postscript using pdftops (not: pdf2ps) and some heuristics. Quite a game of chance... but maybe it works for you too?

    Same approach: CAM::PDF comes with a tool rewritepdf.pl which allows to decompress the internal object streams (-d switch). Analysing the decompressed PDF file might give some hints. A typical table ENTRY might be embedded like this:

    40 0 Td          <-- x, y position (Td: goto text position)
    (ENTRY)Tj        <-- ENTRY         (Tj: show text
    
    The Wikipedia entry for PDF provides a link to "Portable Document Format: An Introduction for Programmers" which provides a lightweight introduction and a table with common PDF operators.

    Update: argl, it's rewritepdf.pl

      "Are your PDF files generated automatically?"

      Nope, it is just one ill-made file.

      I'll look into rwritepdf.pl, as well as the article on Wikipedia.

      Update: Hmm, I get the same sort of output I get when trying to print the page's content. Example:

      Is it encoded unconventionally? Not sure what to do now.

      I'm so adjective, I verb nouns!

      chomp; # nom nom nom

        Hm, ... worst-case scenario: Your table is an embedded image. But I cannot judge that from your update. It might be a logo and the data is still somewhere...?

        Other options: OCR document / contact author.
Re: Extracting information from a PDF file
by Popcorn Dave (Abbot) on Aug 20, 2008 at 20:31 UTC
    There is a non-Perl way to do it, depending on what you're after and how many files you have. Adobe's website offers that as a free service, or at least they used to, so if you're having problems you might check that out as well.


    Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

    I would love to change the world, but they won't give me the source code

      IIRC, Gmail will parse PDF attachments out for display as HTML too... I think I'm remembering right. It was a few months ago that I was playing with it and Adobe's service was either super slow or down. Obviously can't speak to the parse quality.

      Just one file. I need to upload the data I extract to a database. I'll try and see what I can find on their website.

      Update: Do you mean they can extract information or convert the file? :\

      I'm so adjective, I verb nouns!

      chomp; # nom nom nom

        Been a while since I needed it, but as I remember you give them a link to your file and then they send you back the text via e-mail. Hopefully it will do what you want.


        Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

        I would love to change the world, but they won't give me the source code