If "highlighted text" is done, as it should, with "colored overlay", then you extract highlights' positions like this:
use strict;
use warnings;
use feature 'say';
use CAM::PDF;
my $pdf = CAM::PDF-> new( $ARGV[ 0 ]) or die;
my $page = $pdf-> getPage( 1 );
my $anns = $pdf-> getValue( $page-> { Annots } or die );
for ( @$anns ) {
my $ann = $pdf-> getValue( $_ );
next unless $pdf-> getValue( $ann-> { Subtype }) eq 'Highlight';
say $ann;
say "\t$_" for map $pdf-> getValue( $_ ),
@{ $pdf-> getValue( $ann-> { QuadPoints })}
}
__END__
HASH(0xd79f0c)
237.641
651.308
271.059
651.308
237.641
641.602
271.059
641.602
61.4118
637.963
92.1406
637.963
61.4118
628.257
92.1406
628.257
HASH(0xe8f43c)
288.529
611.271
320.753
611.271
288.529
601.566
320.753
601.566
Large pdf. They still didn't fix wrong order of points in that picture, take care. Also extract xml with each character bounding box coordinates. Doing it with pure Perl is possible, but involves too much low level work. I prefer mutool (mudraw, if older versions are packaged for your OS) and its "stext" output, GS might also do, adjust for (0,0) being upper left page corner. Walk over character nodes, start extracting text when BB is inside any quad, until you leave that quad. Continue till page end. It's really easy.
|