http://qs321.pair.com?node_id=1190089

cerian has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to convert the contents of several PDF files into plain text. I have read through a number of threads on here, and on other sites, that attempt to do this. So far, nothing has worked. While I may get the occasional word amongst the gibberish, none of them come close to doing the job. Here is a glimpse of the things I have already tried:

Attempt 1:

use Text::FromAny; my $pdf_file = "foo.pdf"; my $obj = Text::FromAny->new(file => $pdf_file); my $text = $obj->text; print $text;

Attempt 2:

use CAM::PDF; my $pdf_file = "foo.pdf"; my $obj = CAM::PDF->new($pdf_file) || die "$CAM::PDF::errstr\n"; my $txt = $obj->getPageText(1); CAM::PDF->asciify(\$txt); # same results without this statement. print $txt;

Attempt 3:

use CAM::PDF; use CAM::PDF::PageText; my $pdf_file = "foo.pdf"; my $obj = CAM::PDF->new($pdf_file) || die "$CAM::PDF::errstr\n"; my $tree = $obj->getPageContentTree(1); my $txt = CAM::PDF::PageText->render($tree); CAM::PDF->asciify(\$txt); # same results without this statement. print $txt;

Attempt 4: Use the getpdftext.pl source at https://metacpan.org/pod/distribution/CAM-PDF/bin/getpdftext.pl

Any other ideas?

Replies are listed 'Best First'.
Re: Converting PDF file to text
by LanX (Saint) on May 11, 2017 at 18:34 UTC
    > Any other ideas?

    See update of Parsing PDFs by text position? and linked threads

    > nothing had worked

    What does this exactly mean?

    If pdftohtml -xml doesn't produce readable text, your only remaining chance is OCR, because the PDF might embed its own font in random order or even only an image showing the text.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      > your only remaining chance is OCR,

      though you are probably able to decipher the order with Vigenère cipher code breaking techniques.

      Not sure how embedded fonts are handled in PDF.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      "Nothing had worked" meant that the resulting text files were filled with non-ascii gibberish and bore no resemblance to the pdf file.

      In fact, pdftohtml works just fine. Trouble is, it's an executable. A condition I did not mention in the original post, was that this needs to be done by a script within a website's CGI directory. The server is configured not to allow the running of executables in cgi-bin. I do not have admin rights on the server and can not change this.

      So, more specifically, I am looking for a perl based solution to this problem.

        pdftotext is probably the best pdf to text converter. So don't put the executable in cgi-bin...write a script that makes a system call. Please don't tell me that you can't make any system calls from your cgi script?
        I once took a look into the source of pdftohtml and porting it to Perl shouldn't be too difficult. ..

        BUT

        ... it's based on a call to ghostscript which does the hard part.

        And I doubt it can be done otherwise, I can't imagine anyone reimplementing PostScript in Perl.

        So if

        >  is configured not to allow the running of executables in cgi-bin. 

        Then you should start looking for a new server.

        I doubt it's possible to find an open solution not based on ghostscript.

        (Except you find a Web service doing the hard part for you)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        update

        Well can you run executables outside cgi-bin ? And is ghostscript installed?

Re: Converting PDF file to text
by Corion (Patriarch) on May 11, 2017 at 19:16 UTC
      Invoking java on the tika-app-1.14.jar from within perl works fine on my laptop, but not on web site this thing is for. It generates the following trace:
      Exception in thread "main" java.lang.ClassFormatError: org.apache.tika.cli.TikaCLI (unrecognized class file version)
         at java.lang.VMClassLoader.defineClass(libgcj.so.10)
         at java.lang.ClassLoader.defineClass(libgcj.so.10)
         at java.security.SecureClassLoader.defineClass(libgcj.so.10)
         at java.net.URLClassLoader.findClass(libgcj.so.10)
         at java.lang.ClassLoader.loadClass(libgcj.so.10)
         at java.lang.ClassLoader.loadClass(libgcj.so.10)
         at gnu.java.lang.MainThread.run(libgcj.so.10)
      
      I've been attempting to work with the Apache::Tika, Apache::Tika::Asynch, and Apache::Tika::Server modules; but without success. The only documentation I have found for them so far is the brief synopsis section on CPAN. Sadly, those contain errors. Do you have any additional documentation or working code snippets? I have yet to make one of these critters work. Examples of what I've tried are below:

      Attempt 1: Apache::Tika
      use Apache::Tika; my $tika = Apache::Tika->new(); open my $fh, '<:raw', 'x.pdf'; my $pdf = do { local $/; <$fh> }; close $fh; my $text = $tika->tika($pdf); print "$text\n";
      The value in $text is:
      Can't connect to localhost:9998 (Connection refused)
      
      LWP::Protocol::http::Socket: connect: Connection refused at /Library/Perl/5.18/LWP/Protocol/http.pm line 46.
      
      
      Attempt 2: Apache::Tika::Asynch
      use Apache::Tika::Async; my $tika= Apache::Tika::Server->new; my $fn= shift; use Data::Dumper; print Dumper $tika->get_meta($fn); print Dumper $tika->get_text($fn);

      That *is* the CPAN synopsis. It doesn't work. It will die on line 2 with Can't locate object method "new" via package "Apache::Tika::Server" Additionally, there is no new() method within the Apache::Tika::Async class.

      Attempt 3: Apache::Tika::Server
      use Data::Dumper; use Apache::Tika::Server; my $tika= Apache::Tika::Server->new(); # $tika->launch(); my $fn = "x.pdf"; print Dumper $tika->get_text($fn);

      This gets me the following: Got HTTP error code 595 on the call to $tika->get_text. If I uncomment the call to $tika->launch, I get: Use of uninitialized value in join or string at /Library/Perl/5.18/Apache/Tika/Server.pm line 81.

      So far I can find no other information on these libraries on-line. If anyone out there has some documentation or working examples on how to use them, I would love to see it.

        Most likely the version of Java on your web server does not work with the version of Java the Tika JAR file requires. I can't help you there.

        I'm sorry that the synopsis of Apache::Tika::Async is broken - it should look like the following, but it seems I never released that fix onto CPAN:

        use Apache::Tika::Async; my $tika= Apache::Tika::Async->new; my $fn= shift; use Data::Dumper; my $info = $tika->get_all( $fn ); print Dumper $info->meta($fn); print $info->content($fn); # <html><body>... print $info->meta->{"meta:language"}; # en

        But all of this is in vain if the Tika executable won't start.

        Update: I've now published the Git repository of the module, which contains some fixes I should also release soonish.

Re: Converting PDF file to text
by vr (Curate) on May 11, 2017 at 19:11 UTC

    CAM::PDF is very naive (i.e. fitting great to tasks it was designed to solve at the time) about text extraction. Single-byte encoding only, not to mention just Latin1, and the "ToUnicode" tables ignored completely. Don't even try this or any other pure Perl modules for serious extraction. Last time I checked, the muPDF tool (and matching, but not always -- depending on version unfortunately -- Ghostscript's txtwrite device) produce nice xml output with correctly (if it's possible at all for this PDF) encoded characters, along with position, style attributes etc. Then these tools' output can be parsed using XML::Simple or similar, i.e. with Perl.

    Edit. I have a patch for CAM::PDF, but maybe first you provide your typical PDF (or two, if from diverse sources) for tests. Text encoding was not a problem, but as I see there are also rather naive choices in regard to layout heuristics. May run into problems, then it's all not worth the effort and better to use dedicated tools.