![]() |
|
Perl: the Markov chain saw | |
PerlMonks |
Re^3: CAM::PDF extract text and their coordinates from pdf..by snoopy (Curate) |
on Jan 10, 2013 at 05:58 UTC ( #1012592=note: print w/replies, xml ) | Need Help?? |
Hi Umesh,
Yes, that's the same point that I got to. In practice, you end up with a lot of text fragments that need to be reassembled into words and lines. Putting these back together into words and lines is a fair bit of work and can involve some heuristics. Rather than continuing to develop the above, I personally went with pstotext from the Ghostscript suite; it has a `-bboxes` option to output text positions and does attempt to assemble words and lines. Despite it's name it will work on pdf files. Another program I looked at was pdfminer. One of these, or something similar, might work. It's just a matter of how good a job they do. - David
In Section
Seekers of Perl Wisdom
|
|