Reading PDF files

Helter has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reading PDF files by Lachesis (Friar) on Jul 21, 2003 at 14:19 UTC
You could try converting to text or html first and then parsing. Have a look at Online Conversion Tools	[reply]
Re: Re: Reading PDF files by Helter (Chaplain) on Jul 21, 2003 at 15:45 UTC
This should be pretty easy to work into a script, get the name of the url, tack it to the end, and I can automate the process. Thanks!	[reply]
Re: Reading PDF files by traveler (Parson) on Jul 21, 2003 at 14:40 UTC
PDF::API2 has a `stringify` method to extract the text from a pdf. It is a very easy to use module. --traveler	[reply] [d/l]
Re: Re: Reading PDF files by Helter (Chaplain) on Jul 21, 2003 at 15:14 UTC
Giving this a whirl here at work, it seems that the pdfs on the site I'm trying to work with are malformed. I can open pdf files from other sites, but not the one I need to. `Malformed PDF file PDF::API2::IOString=GLOB(0x2252cc) at C:/Perl/site/ +lib/PDF/API2/PDF/FileAPI.pm line 84.` [download] Acrobat must be less picky than this module as I can view them just fine in reader. It also seems the `stringify` function does not parse out the text, it looks like I get the same output I would get from a plain `open()` call. Thanks for the suggestion.	[reply] [d/l] [select]
Re: Re: Re: Reading PDF files by Anonymous Monk on Jul 22, 2003 at 20:57 UTC
i have had similar problems with various PDF->text utilities; they work for some PDFs, but not all. PDF::API2 is still in constant development; usually there are much more recent versions available at the sourceforge page or near there, than you would get from CPAN. building the latest version might get around the error you're getting. i use 0.3d67, which probably isn't even the most recent any more.	[reply]
Re: Reading PDF files by snadra (Scribe) on Jul 21, 2003 at 14:19 UTC
Hello, This is only a assumption, since I am at work, and cannot try if it is woking... But you may want to try this: Get the CPAN module HTML::HTMLDoc::PDF It has a method called to_string wich seems to do what you want. `print $pdf->to_string();` [download] I have no clue how it is handling images, wich are inside of the PDF. snadra	[reply] [d/l]
Re: Re: Reading PDF files by Helter (Chaplain) on Jul 21, 2003 at 15:03 UTC
I don't think this does what you think it does. From the documentation: `This Module is the result of a HTML::HTMLDoc PDF-generation.` [download] So I think if you have generated the pdf object (from HTML) and want to write it to a file, this is the function you would use. I already have a pdf file, and want to get text, not pdf formated output. Thanks for the effort!	[reply] [d/l]
Re: Reading PDF files by jmanning2k (Pilgrim) on Jul 21, 2003 at 15:06 UTC
As an alternative to parsing the PDF file, if you just need some values from the document, and don't need to modify the original PDF, try running 'pdf2ascii' on the file, and then parsing the resulting plain text. Most pdf files have the actual text, not an image of the text as suggested above. (I'm not saying it isn't possible, just unlikely.)	[reply]
Re: Re: Reading PDF files by Willard B. Trophy (Hermit) on Jul 21, 2003 at 15:35 UTC
Yet another (but similar) way to do it: use pdftohtml's XML output mode, and parse that. This has the advantage that it stores position information for the text, and it writes the strings out in the order they were rendered on the page. This can be quite helpful. pdftohtml uses the internals of xpdf to do the work. xpdf comes with the pdftotext tool, which might do all you need. If none of the above works -- and some PDFs do very weird things with font encoding -- if you install the DjVuLibre application, and run your PDF through the Any2DjVu converter, it will do real OCR, the text of which you can extract with the djvused tool. All this is moot, of course, if the terms of use of the original file forbid anything other than reading the document on the screen. Many financial institutions use PDF for its "read only" (for the casual user) nature. -- bowling trophy thieves, die!	[reply]
Re: Re: Reading PDF files by Helter (Chaplain) on Jul 21, 2003 at 15:26 UTC
On my linux box, I have pdf2ps and ps2ascii, but going through the motions with the pdf files under inspection, and the other sample I have breaks. The first during "open" and the second in the conversion from ps2ascii. I can't find this program for a windows machine, is it just a part of the gs distribution? Thanks.	[reply]
Re: Re: Re: Reading PDF files by jmanning2k (Pilgrim) on Jul 21, 2003 at 15:48 UTC
Yes, it comes with ghostscript. It's also available with ghostscript/gview for win32. You can probably find a perl replacement for ps2ascii if you don't want to install that too. I'm not sure why it would break on your PDF's, other than the PDF having some embedded security or binary parts.	[reply]
Re: Reading PDF files by allolex (Curate) on Jul 21, 2003 at 15:21 UTC
Hi, I did something like this by first converting the PDF to text using the tool pdftotext, which gets decent output. There is also pdftohtml, which does HTML. That might be easier to parse. I'm not sure what info pdftohtml saves that pdftotext strips, but I assume there's a difference. BTW, both tools are available for *nix and Windows. Cheerio, -- Allolex	[reply]
Re: Re: Reading PDF files by Helter (Chaplain) on Jul 21, 2003 at 15:42 UTC
This looks to output the same information (and pretty much the same format) as the web based tools above, except in text format. Here's a link to a *nix version: http://www.foolabs.com/xpdf/download.html	[reply]
Re: Reading PDF files by Popcorn Dave (Abbot) on Jul 21, 2003 at 16:04 UTC
I actually did something similar with California sales tax tables. What I did was to use Adobe's online conversion tool via LWP to grab a text version of what I wanted. However I discovered that their tool indeed does convert things to text, but not always as contiguous text. Cities like San Luis Obispo came through as: San Luis Obispo So that's something to be aware of using the Adobe converter. I haven't tried the other techniques so I can't say how things would be converted. HTH! Update: The link to Adobe's online conversion tool There is no emoticon for what I'm feeling now.	[reply]
Re: Reading PDF files by RollyGuy (Chaplain) on Jul 21, 2003 at 14:02 UTC
Disclaimer: I don't know the answer However, I would like to add a word of caution. PDF's can store information as images as well, so if you are trying to parse a PDF of images of text, it will be very difficult and quite a different problem than parsing. Enjoy.	[reply]
Re: Re: Reading PDF files by Helter (Chaplain) on Jul 21, 2003 at 14:18 UTC
When I open the PDF using acrobat, I can use the text selection tool to grab some text, so unless they have some FAST OCR software running in there I don't believe it's an image, but a very good warning.	[reply]
Re: Re: Re: Reading PDF files by waswas-fng (Curate) on Jul 21, 2003 at 15:42 UTC
Depending on the app that created such PDF and the settings/fonts used, you may end up with a pdf that is a bunch of font character bitmaps in sequence or blocked with no underlining text information. Most adobe applications will enbed textual versions in the PDF so the text selection tool can be used to grab plain text from segments in the pdf. I think the point is unles you can be certian how the pdf is generated and you are comfortable with them -- parsing data from them is going to be a large pain in the butt. -Waswas	[reply]
Re: Reading PDF files by The Mad Hatter (Priest) on Jul 21, 2003 at 16:11 UTC
I've never used it, but I've heard PDFLib is great (not sure if the Perl bindings are on CPAN). If you haven't looked at it already, you might want to check it out (particularly the alternative language bindings...Perl as an "alternative language", they must not get out much... ; ).	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks