Converting PDF file to text

cerian has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to convert the contents of several PDF files into plain text. I have read through a number of threads on here, and on other sites, that attempt to do this. So far, nothing has worked. While I may get the occasional word amongst the gibberish, none of them come close to doing the job. Here is a glimpse of the things I have already tried:

Attempt 1:

use Text::FromAny;

my $pdf_file = "foo.pdf";
my $obj      = Text::FromAny->new(file => $pdf_file);
my $text     = $obj->text;    
print $text;
[download]

Attempt 2:

use CAM::PDF;

my $pdf_file = "foo.pdf";
my $obj      = CAM::PDF->new($pdf_file) || die "$CAM::PDF::errstr\n";
my $txt      = $obj->getPageText(1);
CAM::PDF->asciify(\$txt);       # same results without this statement.
print $txt;
[download]

Attempt 3:

use CAM::PDF;
use CAM::PDF::PageText;

my $pdf_file = "foo.pdf";
my $obj      = CAM::PDF->new($pdf_file) || die "$CAM::PDF::errstr\n";
my $tree     = $obj->getPageContentTree(1);
my $txt      = CAM::PDF::PageText->render($tree);
CAM::PDF->asciify(\$txt);       # same results without this statement.
print $txt;
[download]

Attempt 4: Use the getpdftext.pl source at https://metacpan.org/pod/distribution/CAM-PDF/bin/getpdftext.pl

Any other ideas?

Comment on Converting PDF file to text Select or Download Code

Replies are listed 'Best First'.
Re: Converting PDF file to text by LanX (Saint) on May 11, 2017 at 18:34 UTC
> Any other ideas? See update of Parsing PDFs by text position? and linked threads > nothing had worked What does this exactly mean? If `pdftohtml -xml` doesn't produce readable text, your only remaining chance is OCR, because the PDF might embed its own font in random order or even only an image showing the text. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^2: Converting PDF file to text by LanX (Saint) on May 11, 2017 at 19:08 UTC
> your only remaining chance is OCR, though you are probably able to decipher the order with Vigenère cipher code breaking techniques. Not sure how embedded fonts are handled in PDF. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^2: Converting PDF file to text by cerian (Novice) on May 12, 2017 at 16:26 UTC
"Nothing had worked" meant that the resulting text files were filled with non-ascii gibberish and bore no resemblance to the pdf file. In fact, pdftohtml works just fine. Trouble is, it's an executable. A condition I did not mention in the original post, was that this needs to be done by a script within a website's CGI directory. The server is configured not to allow the running of executables in cgi-bin. I do not have admin rights on the server and can not change this. So, more specifically, I am looking for a perl based solution to this problem.	[reply]
Re^3: Converting PDF file to text by runrig (Abbot) on May 12, 2017 at 21:54 UTC
pdftotext is probably the best pdf to text converter. So don't put the executable in cgi-bin...write a script that makes a system call. Please don't tell me that you can't make any system calls from your cgi script?	[reply]
Re^4: Converting PDF file to text by LanX (Saint) on May 13, 2017 at 11:59 UTC
Re^3: Converting PDF file to text by LanX (Saint) on May 12, 2017 at 17:08 UTC
I once took a look into the source of pdftohtml and porting it to Perl shouldn't be too difficult. .. BUT ... it's based on a call to ghostscript which does the hard part. And I doubt it can be done otherwise, I can't imagine anyone reimplementing PostScript in Perl. So if > is configured not to allow the running of executables in cgi-bin. Then you should start looking for a new server. I doubt it's possible to find an open solution not based on ghostscript. (Except you find a Web service doing the hard part for you) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!} update Well can you run executables outside cgi-bin ? And is ghostscript installed?	[reply]
Re: Converting PDF file to text by Corion (Patriarch) on May 11, 2017 at 19:16 UTC
I had good success with using Apache Tika for text extraction from PDFs (also, Apache::Tika::Async).	[reply]
Re^2: Converting PDF file to text by cerian (Novice) on May 12, 2017 at 17:13 UTC
Invoking java on the tika-app-1.14.jar from within perl works fine on my laptop, but not on web site this thing is for. It generates the following trace: Exception in thread "main" java.lang.ClassFormatError: org.apache.tika.cli.TikaCLI (unrecognized class file version) at java.lang.VMClassLoader.defineClass(libgcj.so.10) at java.lang.ClassLoader.defineClass(libgcj.so.10) at java.security.SecureClassLoader.defineClass(libgcj.so.10) at java.net.URLClassLoader.findClass(libgcj.so.10) at java.lang.ClassLoader.loadClass(libgcj.so.10) at java.lang.ClassLoader.loadClass(libgcj.so.10) at gnu.java.lang.MainThread.run(libgcj.so.10) I've been attempting to work with the Apache::Tika, Apache::Tika::Asynch, and Apache::Tika::Server modules; but without success. The only documentation I have found for them so far is the brief synopsis section on CPAN. Sadly, those contain errors. Do you have any additional documentation or working code snippets? I have yet to make one of these critters work. Examples of what I've tried are below: Attempt 1: Apache::Tika `use Apache::Tika; my $tika = Apache::Tika->new(); open my $fh, '<:raw', 'x.pdf'; my $pdf = do { local $/; <$fh> }; close $fh; my $text = $tika->tika($pdf); print "$text\n";` [download] The value in $text is: Can't connect to localhost:9998 (Connection refused) LWP::Protocol::http::Socket: connect: Connection refused at /Library/Perl/5.18/LWP/Protocol/http.pm line 46. Attempt 2: Apache::Tika::Asynch `use Apache::Tika::Async; my $tika= Apache::Tika::Server->new; my $fn= shift; use Data::Dumper; print Dumper $tika->get_meta($fn); print Dumper $tika->get_text($fn);` [download] That is the CPAN synopsis. It doesn't work. It will die on line 2 with Can't locate object method "new" via package "Apache::Tika::Server" Additionally, there is no new() method within the Apache::Tika::Async class. Attempt 3: Apache::Tika::Server `use Data::Dumper; use Apache::Tika::Server; my $tika= Apache::Tika::Server->new(); # $tika->launch(); my $fn = "x.pdf"; print Dumper $tika->get_text($fn);` [download] This gets me the following: Got HTTP error code 595 on the call to $tika->get_text. If I uncomment the call to $tika->launch, I get: Use of uninitialized value in join or string at /Library/Perl/5.18/Apache/Tika/Server.pm line 81. So far I can find no other information on these libraries on-line. If anyone out there has some documentation or working examples on how to use them, I would love to see it.	[reply] [d/l] [select]
Re^3: Converting PDF file to text by Corion (Patriarch) on May 13, 2017 at 06:35 UTC
Most likely the version of Java on your web server does not work with the version of Java the Tika JAR file requires. I can't help you there. I'm sorry that the synopsis of Apache::Tika::Async is broken - it should look like the following, but it seems I never released that fix onto CPAN: `use Apache::Tika::Async; my $tika= Apache::Tika::Async->new; my $fn= shift; use Data::Dumper; my $info = $tika->get_all( $fn ); print Dumper $info->meta($fn); print $info->content($fn); # <html><body>... print $info->meta->{"meta:language"}; # en` [download] But all of this is in vain if the Tika executable won't start. Update: I've now published the Git repository of the module, which contains some fixes I should also release soonish.	[reply] [d/l]
Re: Converting PDF file to text by vr (Curate) on May 11, 2017 at 19:11 UTC
`CAM::PDF` is very naive (i.e. fitting great to tasks it was designed to solve at the time) about text extraction. Single-byte encoding only, not to mention just `Latin1`, and the "ToUnicode" tables ignored completely. Don't even try this or any other pure Perl modules for serious extraction. Last time I checked, the `muPDF` tool (and matching, but not always -- depending on version unfortunately -- Ghostscript's `txtwrite` device) produce nice `xml` output with correctly (if it's possible at all for this PDF) encoded characters, along with position, style attributes etc. Then these tools' output can be parsed using `XML::Simple` or similar, i.e. with Perl. Edit. I have a patch for `CAM::PDF`, but maybe first you provide your typical PDF (or two, if from diverse sources) for tests. Text encoding was not a problem, but as I see there are also rather naive choices in regard to layout heuristics. May run into problems, then it's all not worth the effort and better to use dedicated tools.	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom

update