Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Read highlighted text from PDF

by IB2017 (Pilgrim)
on Sep 27, 2018 at 20:01 UTC ( [id://1223184]=perlquestion: print w/replies, xml ) Need Help??

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I have a (huge) bunch of PDF files with highlighted textual parts in them. I need to programmatically extract these parts. I am already able to extract text from PDFs, but without any distinction on however I was wandering if somebody knows of any module/approach that could help me in doing this. There are software out there that are capable of doing this (for example Zotero), so it should be technically possible, however I haven't found any module that can help me in implementing this nor any information on the Web pointing at some solution. Any ideas?

Replies are listed 'Best First'.
Re: Read highlighted text from PDF
by vr (Curate) on Sep 28, 2018 at 11:16 UTC

    If "highlighted text" is done, as it should, with "colored overlay", then you extract highlights' positions like this:

    use strict; use warnings; use feature 'say'; use CAM::PDF; my $pdf = CAM::PDF-> new( $ARGV[ 0 ]) or die; my $page = $pdf-> getPage( 1 ); my $anns = $pdf-> getValue( $page-> { Annots } or die ); for ( @$anns ) { my $ann = $pdf-> getValue( $_ ); next unless $pdf-> getValue( $ann-> { Subtype }) eq 'Highlight'; say $ann; say "\t$_" for map $pdf-> getValue( $_ ), @{ $pdf-> getValue( $ann-> { QuadPoints })} } __END__ HASH(0xd79f0c) 237.641 651.308 271.059 651.308 237.641 641.602 271.059 641.602 61.4118 637.963 92.1406 637.963 61.4118 628.257 92.1406 628.257 HASH(0xe8f43c) 288.529 611.271 320.753 611.271 288.529 601.566 320.753 601.566

    Large pdf. They still didn't fix wrong order of points in that picture, take care. Also extract xml with each character bounding box coordinates. Doing it with pure Perl is possible, but involves too much low level work. I prefer mutool (mudraw, if older versions are packaged for your OS) and its "stext" output, GS might also do, adjust for (0,0) being upper left page corner. Walk over character nodes, start extracting text when BB is inside any quad, until you leave that quad. Continue till page end. It's really easy.

Re: Read highlighted text from PDF
by LanX (Saint) on Sep 28, 2018 at 00:35 UTC
    > Any ideas?

    I suppose there are multiple ways to have an "highlighted text" effect in PDF.

    If it's achieved by changing the font-number, you may want to try pdftohtml -xml and check if it shows in the output.

    Otherwise no idea ... good luck if its done with a colored overlay.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Read highlighted text from PDF
by bliako (Monsignor) on Sep 28, 2018 at 09:30 UTC

    I had good results (for a couple of pages only) with the OCR approach to extracting text from PDF. I was impressed it worked also relatively well for equations, extracting them as latex. I have used a demo-copy of a commercial software (run in linux via wine) called InftyReader, it allows only 5 pages of text per day. But you may want to test your mileage. I only had 2 pages to do and it was a very high quality pdf document produced by latex whose source we lost.

    For setting your own OCR engine there is Tesseract and there are Perl modules (e.g. Image::OCR::Tesseract) to interact with it. Or you may prefer to interface to it with opencv (c++) which will also give you access to its vast library of image processing algorithms for de-noising etc.

    I have not done it myself in a large scale but only to play and that was a few years back. I remember it was "difficult" to set up. It would be interesting to see if that works for you.

    The important thing with Tesseract is that it allows for training and learning on sampled text. So, if your text volume is huge so as to justify the investment and is relatively constant on fonts and layout, you may be lucky and create something which works beyond 90% success.

    Update: in the case of color-highlighted text, OCR will work super because you can do image pre-processing and separate text wrt color or even wrt to font and its attributes: bold or italic. Which means that combining the OCR approach and the source-code-reversal approach we usually try with pdfto* will give you extra power.

    bw, bliako

Re: Read highlighted text from PDF
by ablanke (Monsignor) on Sep 28, 2018 at 09:40 UTC
    Hi IB2017,

    i would suggest CAM::PDF.

    The functions getPageContent or getPageContentTree could be useful for you.

    In order to unterstand output, please consider the PDF Reference from Adobe.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1223184]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-04-25 10:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found