Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Reading PDF files

by Helter (Chaplain)
on Jul 21, 2003 at 13:44 UTC ( [id://276272]=perlquestion: print w/replies, xml ) Need Help??

Helter has asked for the wisdom of the Perl Monks concerning the following question:

My father-in-law came to me yesterday and asked (This was out of the blue and quite unexpected) to write a program for him that would take some information from a web-page, do a few very simple (division a few times) calculations and show it to him.

I said, "Sure that's easy, give me a day or two!".
Silly me.
The data he is looking at is stored as a PDF, and although I have installed the PDF::Parse package I can't get it to work. I have seen others say that it is buggy as well.

I've searched using a super search, and all I can find is people creating PDF files.

I think if I could find an example of someone using the PDF::Parse package I could get it working, but Perl complains that certain functions used on the cpan documentation page don't exist (cut'n'paste). I would wager I'm doing something wrong, but I'm not sure of the intended usage.

Has anyone parsed pdf files, and if so what method did you use? I would prefer a perl-only solution, as I'm writing the code on a Linux box, and he will be running it on a WinXP machine.

Also the less I have to install the better, my Father-in-law is a bright guy, but his computer usage level is browsing a few websites and that's about it.

Thanks!

Replies are listed 'Best First'.
Re: Reading PDF files
by Lachesis (Friar) on Jul 21, 2003 at 14:19 UTC
      This should be pretty easy to work into a script, get the name of the url, tack it to the end, and I can automate the process.

      Thanks!
Re: Reading PDF files
by traveler (Parson) on Jul 21, 2003 at 14:40 UTC
    PDF::API2 has a stringify method to extract the text from a pdf. It is a very easy to use module.

    --traveler

      Giving this a whirl here at work, it seems that the pdfs on the site I'm trying to work with are malformed. I can open pdf files from other sites, but not the one I need to.

      Malformed PDF file PDF::API2::IOString=GLOB(0x2252cc) at C:/Perl/site/ +lib/PDF/API2/PDF/FileAPI.pm line 84.
      Acrobat must be less picky than this module as I can view them just fine in reader.

      It also seems the stringify function does not parse out the text, it looks like I get the same output I would get from a plain open() call.

      Thanks for the suggestion.
        i have had similar problems with various PDF->text utilities; they work for some PDFs, but not all.

        PDF::API2 is still in constant development; usually there are much more recent versions available at the sourceforge page or near there, than you would get from CPAN. building the latest version might get around the error you're getting. i use 0.3d67, which probably isn't even the most recent any more.

Re: Reading PDF files
by snadra (Scribe) on Jul 21, 2003 at 14:19 UTC
    Hello,

    This is only a assumption, since I am at work, and cannot try if it is woking...
    But you may want to try this:
    Get the CPAN module HTML::HTMLDoc::PDF
    It has a method called to_string wich seems to do what you want.
    print $pdf->to_string();
    I have no clue how it is handling images, wich are inside of the PDF.

    snadra
      I don't think this does what you think it does. From the documentation:
      This Module is the result of a HTML::HTMLDoc PDF-generation.
      So I think if you have generated the pdf object (from HTML) and want to write it to a file, this is the function you would use.

      I already have a pdf file, and want to get text, not pdf formated output.
      Thanks for the effort!

Re: Reading PDF files
by jmanning2k (Pilgrim) on Jul 21, 2003 at 15:06 UTC
    As an alternative to parsing the PDF file, if you just need some values from the document, and don't need to modify the original PDF, try running 'pdf2ascii' on the file, and then parsing the resulting plain text.

    Most pdf files have the actual text, not an image of the text as suggested above. (I'm not saying it isn't possible, just unlikely.)
      Yet another (but similar) way to do it: use pdftohtml's XML output mode, and parse that. This has the advantage that it stores position information for the text, and it writes the strings out in the order they were rendered on the page. This can be quite helpful.

      pdftohtml uses the internals of xpdf to do the work. xpdf comes with the pdftotext tool, which might do all you need.

      If none of the above works -- and some PDFs do very weird things with font encoding -- if you install the DjVuLibre application, and run your PDF through the Any2DjVu converter, it will do real OCR, the text of which you can extract with the djvused tool.

      All this is moot, of course, if the terms of use of the original file forbid anything other than reading the document on the screen. Many financial institutions use PDF for its "read only" (for the casual user) nature.

      --
      bowling trophy thieves, die!

      On my linux box, I have pdf2ps and ps2ascii, but going through the motions with the pdf files under inspection, and the other sample I have breaks. The first during "open" and the second in the conversion from ps2ascii.

      I can't find this program for a windows machine, is it just a part of the gs distribution?

      Thanks.
        Yes, it comes with ghostscript. It's also available with ghostscript/gview for win32.

        You can probably find a perl replacement for ps2ascii if you don't want to install that too.

        I'm not sure why it would break on your PDF's, other than the PDF having some embedded security or binary parts.
Re: Reading PDF files
by allolex (Curate) on Jul 21, 2003 at 15:21 UTC

    Hi, I did something like this by first converting the PDF to text using the tool pdftotext, which gets decent output. There is also pdftohtml, which does HTML. That might be easier to parse. I'm not sure what info pdftohtml saves that pdftotext strips, but I assume there's a difference.

    BTW, both tools are available for *nix and Windows.

    Cheerio,

    --
    Allolex

Re: Reading PDF files
by Popcorn Dave (Abbot) on Jul 21, 2003 at 16:04 UTC
    I actually did something similar with California sales tax tables.

    What I did was to use Adobe's online conversion tool via LWP to grab a text version of what I wanted. However I discovered that their tool indeed does convert things to text, but not always as contiguous text. Cities like San Luis Obispo came through as:

    San Luis
    Obispo

    So that's something to be aware of using the Adobe converter. I haven't tried the other techniques so I can't say how things would be converted.

    HTH!

    Update: The link to Adobe's online conversion tool

    There is no emoticon for what I'm feeling now.

Re: Reading PDF files
by RollyGuy (Chaplain) on Jul 21, 2003 at 14:02 UTC
    Disclaimer: I don't know the answer
    However, I would like to add a word of caution. PDF's can store information as images as well, so if you are trying to parse a PDF of images of text, it will be very difficult and quite a different problem than parsing.
    Enjoy.
      When I open the PDF using acrobat, I can use the text selection tool to grab some text, so unless they have some FAST OCR software running in there I don't believe it's an image, but a very good warning.
        Depending on the app that created such PDF and the settings/fonts used, you may end up with a pdf that is a bunch of font character bitmaps in sequence or blocked with no underlining text information. Most adobe applications will enbed textual versions in the PDF so the text selection tool can be used to grab plain text from segments in the pdf. I think the point is unles you can be certian how the pdf is generated and you are comfortable with them -- parsing data from them is going to be a large pain in the butt.

        -Waswas
Re: Reading PDF files
by The Mad Hatter (Priest) on Jul 21, 2003 at 16:11 UTC
    I've never used it, but I've heard PDFLib is great (not sure if the Perl bindings are on CPAN). If you haven't looked at it already, you might want to check it out (particularly the alternative language bindings...Perl as an "alternative language", they must not get out much... ; ).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://276272]
Approved by RollyGuy
Front-paged by RollyGuy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-19 23:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found