Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Reading PDF files

by jmanning2k (Pilgrim)
on Jul 21, 2003 at 15:06 UTC ( [id://276310]=note: print w/replies, xml ) Need Help??


in reply to Reading PDF files

As an alternative to parsing the PDF file, if you just need some values from the document, and don't need to modify the original PDF, try running 'pdf2ascii' on the file, and then parsing the resulting plain text.

Most pdf files have the actual text, not an image of the text as suggested above. (I'm not saying it isn't possible, just unlikely.)

Replies are listed 'Best First'.
Re: Re: Reading PDF files
by Willard B. Trophy (Hermit) on Jul 21, 2003 at 15:35 UTC
    Yet another (but similar) way to do it: use pdftohtml's XML output mode, and parse that. This has the advantage that it stores position information for the text, and it writes the strings out in the order they were rendered on the page. This can be quite helpful.

    pdftohtml uses the internals of xpdf to do the work. xpdf comes with the pdftotext tool, which might do all you need.

    If none of the above works -- and some PDFs do very weird things with font encoding -- if you install the DjVuLibre application, and run your PDF through the Any2DjVu converter, it will do real OCR, the text of which you can extract with the djvused tool.

    All this is moot, of course, if the terms of use of the original file forbid anything other than reading the document on the screen. Many financial institutions use PDF for its "read only" (for the casual user) nature.

    --
    bowling trophy thieves, die!

Re: Re: Reading PDF files
by Helter (Chaplain) on Jul 21, 2003 at 15:26 UTC
    On my linux box, I have pdf2ps and ps2ascii, but going through the motions with the pdf files under inspection, and the other sample I have breaks. The first during "open" and the second in the conversion from ps2ascii.

    I can't find this program for a windows machine, is it just a part of the gs distribution?

    Thanks.
      Yes, it comes with ghostscript. It's also available with ghostscript/gview for win32.

      You can probably find a perl replacement for ps2ascii if you don't want to install that too.

      I'm not sure why it would break on your PDF's, other than the PDF having some embedded security or binary parts.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://276310]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (1)
As of 2024-04-25 00:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found