Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Converting PDF file to text

by LanX (Saint)
on May 11, 2017 at 18:34 UTC ( [id://1190091]=note: print w/replies, xml ) Need Help??


in reply to Converting PDF file to text

> Any other ideas?

See update of Parsing PDFs by text position? and linked threads

> nothing had worked

What does this exactly mean?

If pdftohtml -xml doesn't produce readable text, your only remaining chance is OCR, because the PDF might embed its own font in random order or even only an image showing the text.

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^2: Converting PDF file to text
by LanX (Saint) on May 11, 2017 at 19:08 UTC
    > your only remaining chance is OCR,

    though you are probably able to decipher the order with Vigenère cipher code breaking techniques.

    Not sure how embedded fonts are handled in PDF.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re^2: Converting PDF file to text
by cerian (Novice) on May 12, 2017 at 16:26 UTC
    "Nothing had worked" meant that the resulting text files were filled with non-ascii gibberish and bore no resemblance to the pdf file.

    In fact, pdftohtml works just fine. Trouble is, it's an executable. A condition I did not mention in the original post, was that this needs to be done by a script within a website's CGI directory. The server is configured not to allow the running of executables in cgi-bin. I do not have admin rights on the server and can not change this.

    So, more specifically, I am looking for a perl based solution to this problem.

      pdftotext is probably the best pdf to text converter. So don't put the executable in cgi-bin...write a script that makes a system call. Please don't tell me that you can't make any system calls from your cgi script?
        > pdftotext is probably the best pdf to text converter.

        I disagree :)

        > ...write a script that makes a system call. Please don't tell me that you can't make any system calls from your cgi script?

        I agree. :)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

      I once took a look into the source of pdftohtml and porting it to Perl shouldn't be too difficult. ..

      BUT

      ... it's based on a call to ghostscript which does the hard part.

      And I doubt it can be done otherwise, I can't imagine anyone reimplementing PostScript in Perl.

      So if

      >  is configured not to allow the running of executables in cgi-bin. 

      Then you should start looking for a new server.

      I doubt it's possible to find an open solution not based on ghostscript.

      (Except you find a Web service doing the hard part for you)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      update

      Well can you run executables outside cgi-bin ? And is ghostscript installed?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1190091]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2024-04-23 12:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found