mitd,
In the past I have used CAM::PDF to deal with PDF files. I think it depends on exactly what information you wish to parse/extract, feel free to provide detailed examples of what you are trying to achieve. Check out the documentation and examples that are provided with CAM::PDF.
Hope this helps,
Martin | [reply] [Watch: Dir/Any] |
It's not part of CPAN, but on the recommendations of several monks, I had a lot of success using pdftotext, part of the XPDF open source project. It allows you to extract text from pdf to an ascii format,
Your mileage may vary, of course, depending on what you want/need to do, but if you are doing text extractions, I heartily recommend pdftotext.
--
tbone1, YAPS (Yet Another Perl Schlub)
And remember, if he succeeds, so what.
- Chick McGee
| [reply] [Watch: Dir/Any] |
Check out also PDF::Reuse. Its source code is quite obscure and binary-stream oriented, but it does what it says. Allows to extract and insert text, images, barcodes, single pages, ...
It has a module approach (many functions and no main object) rather than being OOP.
In the end I found it didn't suit my needs, and I decided to contribute to PDF::ReportWriter, which does other things.
| [reply] [Watch: Dir/Any] |
I have used another non-module approach: http://pdftohtml.sourceforge.net . It translates pdf to XML or HTML. The XML isn't valid, but it is not difficult to fix. This code is also based on xpdf.
I like this approach because it gives me a bunch of text box strings with their bounding box coordinates, which I then sort by location. This is important for me because the documents that I parse tend to have an irregular 'document order.'
I have also found pdf tips and tricks on the mostly commercial http://www.pdfzone.com site.
It should work perfectly the first time! - toma
| [reply] [Watch: Dir/Any] |