Read highlighted text from PDF

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Read highlighted text from PDF by vr (Curate) on Sep 28, 2018 at 11:16 UTC
If "highlighted text" is done, as it should, with "colored overlay", then you extract highlights' positions like this: use strict; use warnings; use feature 'say'; use CAM::PDF; my $pdf = CAM::PDF-> new( $ARGV[ 0 ]) or die; my $page = $pdf-> getPage( 1 ); my $anns = $pdf-> getValue( $page-> { Annots } or die ); for ( @$anns ) { my $ann = $pdf-> getValue( $_ ); next unless $pdf-> getValue( $ann-> { Subtype }) eq 'Highlight'; say $ann; say "\t$_" for map $pdf-> getValue( $_ ), @{ $pdf-> getValue( $ann-> { QuadPoints })} } __END__ HASH(0xd79f0c) 237.641 651.308 271.059 651.308 237.641 641.602 271.059 641.602 61.4118 637.963 92.1406 637.963 61.4118 628.257 92.1406 628.257 HASH(0xe8f43c) 288.529 611.271 320.753 611.271 288.529 601.566 320.753 601.566 [download] Large pdf. They still didn't fix wrong order of points in that picture, take care. Also extract xml with each character bounding box coordinates. Doing it with pure Perl is possible, but involves too much low level work. I prefer mutool (mudraw, if older versions are packaged for your OS) and its "stext" output, GS might also do, adjust for (0,0) being upper left page corner. Walk over character nodes, start extracting text when BB is inside any quad, until you leave that quad. Continue till page end. It's really easy.	[reply] [d/l]
Re: Read highlighted text from PDF by LanX (Saint) on Sep 28, 2018 at 00:35 UTC
> Any ideas? I suppose there are multiple ways to have an "highlighted text" effect in PDF. If it's achieved by changing the font-number, you may want to try `pdftohtml -xml` and check if it shows in the output. Otherwise no idea ... good luck if its done with a colored overlay. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l]
Re: Read highlighted text from PDF by bliako (Monsignor) on Sep 28, 2018 at 09:30 UTC
I had good results (for a couple of pages only) with the OCR approach to extracting text from PDF. I was impressed it worked also relatively well for equations, extracting them as latex. I have used a demo-copy of a commercial software (run in linux via wine) called InftyReader, it allows only 5 pages of text per day. But you may want to test your mileage. I only had 2 pages to do and it was a very high quality pdf document produced by latex whose source we lost. For setting your own OCR engine there is Tesseract and there are Perl modules (e.g. Image::OCR::Tesseract) to interact with it. Or you may prefer to interface to it with opencv (c++) which will also give you access to its vast library of image processing algorithms for de-noising etc. I have not done it myself in a large scale but only to play and that was a few years back. I remember it was "difficult" to set up. It would be interesting to see if that works for you. The important thing with Tesseract is that it allows for training and learning on sampled text. So, if your text volume is huge so as to justify the investment and is relatively constant on fonts and layout, you may be lucky and create something which works beyond 90% success. Update: in the case of color-highlighted text, OCR will work super because you can do image pre-processing and separate text wrt color or even wrt to font and its attributes: bold or italic. Which means that combining the OCR approach and the source-code-reversal approach we usually try with `pdfto*` will give you extra power. bw, bliako	[reply] [d/l]
Re: Read highlighted text from PDF by ablanke (Monsignor) on Sep 28, 2018 at 09:40 UTC
Hi IB2017, i would suggest CAM::PDF. The functions getPageContent or getPageContentTree could be useful for you. In order to unterstand output, please consider the PDF Reference from Adobe.	[reply]


"be consistent"
	PerlMonks