Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Extracting text from PDF. No really

by mirod (Canon)
on Mar 28, 2008 at 13:16 UTC ( #676974=note: print w/replies, xml ) Need Help??


in reply to Extracting text from PDF. No really

I have had some success with pdftohtml in the past.

It wasn't easy though. The tool has 2 major modes: I can't remember exactly what the problem was with the html mode, but I ended up not using it at all. I used the xml mode, with a LOT of post processing (in Perl).

For starters the XML was not valid (i, b, u and a tags where not properly nested), so I had to disentangle them. Then what you get is a bunch of strings with their position on the page. From there I had to order them, merge them to create lines (sub/super scripts needed to be handled of course), and then create paragraphs... fun!

That was with version 0.36, the one that seems to come with most Linux distributions (it was released in 2002). Sourceforge has some more recent ("experimental") versions. I tried 0.40a, which produced a wildly different output, at least in xml mode, and gave up. The problem with version 0.36 is that it has problems with some recent pdf (version 1.6).

Overall it was quite painful, but in the end I managed to extract some information from the files.

Obbly enough I am currently using pdftotext for an other project, and it seems to be doing quite well, even though of course the output is simpler than what pdftohtml produces. I haven't noticed it dropping letters so far.

  • Comment on Re: Extracting text from PDF. No really

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://676974]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2023-02-08 22:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (44 votes). Check out past polls.

    Notices?