in reply to Re^2: CAM::PDF extract text and their coordinates from pdf..
in thread CAM::PDF extract text and their coordinates from pdf..
Yes, that's the same point that I got to.
In practice, you end up with a lot of text fragments that need to be reassembled into words and lines. Putting these back together into words and lines is a fair bit of work and can involve some heuristics.
Rather than continuing to develop the above, I personally went with pstotext from the Ghostscript suite; it has a `-bboxes` option to output text positions and does attempt to assemble words and lines. Despite it's name it will work on pdf files.
Another program I looked at was pdfminer.
One of these, or something similar, might work. It's just a matter of how good a job they do.
- David
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: CAM::PDF extract text and their coordinates from pdf..
by umesh_epub (Novice) on Jan 10, 2013 at 13:04 UTC | |
by snoopy (Curate) on Jan 10, 2013 at 23:19 UTC |