[Updated] Extracting information from a PDF file

Lawliet has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting information from a PDF file by Perlbotics (Archbishop) on Aug 20, 2008 at 21:59 UTC
I am not aware of a CPAN-module that offers a kind of `extract_table(page => 42, row => 1, column => 3);` method. Creating that wouldn't be easy since the PDF-operators a more like plotter commands plotting on a sheet of paper, so there is no markup like a `<TABLE>` in HTML which defines some embedded object. Are your PDF files generated automatically, that is to say in a repeatable fashion? I once managed to extract table based information from a series of automatically generated PDF files after converting them into Postscript using pdftops (not: pdf2ps) and some heuristics. Quite a game of chance... but maybe it works for you too? Same approach: CAM::PDF comes with a tool rewritepdf.pl which allows to decompress the internal object streams (-d switch). Analysing the decompressed PDF file might give some hints. A typical table ENTRY might be embedded like this: 40 0 Td <-- x, y position (Td: goto text position) (ENTRY)Tj <-- ENTRY (Tj: show text The Wikipedia entry for PDF provides a link to "Portable Document Format: An Introduction for Programmers" which provides a lightweight introduction and a table with common PDF operators. Update: argl, it's rewritepdf.pl	[reply] [d/l] [select]
Re^2: Extracting information from a PDF file by Lawliet (Curate) on Aug 20, 2008 at 22:08 UTC
"Are your PDF files generated automatically?" Nope, it is just one ill-made file. I'll look into rwritepdf.pl, as well as the article on Wikipedia. Update: Hmm, I get the same sort of output I get when trying to print the page's content. Example: Read more... (10 kB) Is it encoded unconventionally? Not sure what to do now. I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply] [d/l]
Re^3: Extracting information from a PDF file by Perlbotics (Archbishop) on Aug 20, 2008 at 22:46 UTC
Hm, ... worst-case scenario: Your table is an embedded image. But I cannot judge that from your update. It might be a logo and the data is still somewhere...? Other options: OCR document / contact author.	[reply]
Re^4: Extracting information from a PDF file by Lawliet (Curate) on Aug 20, 2008 at 22:54 UTC
Re: Extracting information from a PDF file by Popcorn Dave (Abbot) on Aug 20, 2008 at 20:31 UTC
There is a non-Perl way to do it, depending on what you're after and how many files you have. Adobe's website offers that as a free service, or at least they used to, so if you're having problems you might check that out as well. Revolution. Today, 3 O'Clock. Meet behind the monkey bars. I would love to change the world, but they won't give me the source code	[reply]
Re^2: Extracting information from a PDF file by Your Mother (Archbishop) on Aug 20, 2008 at 22:30 UTC
IIRC, Gmail will parse PDF attachments out for display as HTML too... I think I'm remembering right. It was a few months ago that I was playing with it and Adobe's service was either super slow or down. Obviously can't speak to the parse quality.	[reply]
Re^2: Extracting information from a PDF file by Lawliet (Curate) on Aug 20, 2008 at 20:35 UTC
Just one file. I need to upload the data I extract to a database. I'll try and see what I can find on their website. Update: Do you mean they can extract information or convert the file? :\ I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply]
Re^3: Extracting information from a PDF file by Popcorn Dave (Abbot) on Aug 20, 2008 at 22:26 UTC
Been a while since I needed it, but as I remember you give them a link to your file and then they send you back the text via e-mail. Hopefully it will do what you want. Revolution. Today, 3 O'Clock. Meet behind the monkey bars. I would love to change the world, but they won't give me the source code	[reply]
Re^4: Extracting information from a PDF file by Lawliet (Curate) on Aug 20, 2008 at 22:29 UTC


We don't bite newbies here... much
	PerlMonks