Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Regular Expression to Parse Data from a PDF

by vr (Curate)
on Feb 27, 2020 at 12:13 UTC ( [id://11113483]=note: print w/replies, xml ) Need Help??


in reply to Regular Expression to Parse Data from a PDF

(OT, not really Perl)

That approach won't work, in general. Text extraction from PDF always involves some level of heuristics, especially with tables and/or formatting. CAM::PDF is very naive about extraction and is good for simple checks only, for limited subset of plain English. You may wish to take a look at CAM::PDF::getPageContent output:

BT /Times 10 Tf 1 0 0 1 50.172 549.238 Tm 0 G [ (2/27/2020)] TJ 1 0 0 1 93.232 549.238 Tm 0 G [ (3:07)] TJ 1 0 0 1 113.512 549.238 Tm 0 G [ (AM)] TJ 1 0 0 1 428.004 549.238 Tm 0 G [ -24977 (Quotations)] TJ 1 0 0 1 724.166 549.238 Tm 0 G [ (Due)] TJ 1 0 0 1 743.326 549.238 Tm 0 G [ (By:)] TJ 1 0 0 1 760.276 549.238 Tm 0 G [ (01/22/2020)] TJ /TimesB 14 Tf 1 0 0 1 50.172 533.238 Tm 0 G [ -17016 (ABSTRA) 55 (CT)] TJ 1 0 0 1 367.356 533.238 Tm 0 G [ (OF)] TJ 1 0 0 1 390.302 533.238 Tm 0 G [ (UNSTRAPPED)] TJ 1 0 0 1 487.91 533.238 Tm 0 G [ (\(A) 130 (W) 120 (ARDED\))] TJ /Times 9 Tf 1 0 0 1 50.172 522.238 Tm 0 G [ (Jack) 10 (et)] TJ 1 0 0 1 100.549 522.238 Tm 0 G [ (A) 92 (wd)] TJ 1 0 0 1 150.926 522.238 Tm 0 G [ (Contractor)] TJ 1 0 0 1 150.926 511.238 Tm 0 G [ (Code)] TJ 1 0 0 1 201.303 522.238 Tm 0 G [ (Name)] TJ %... etc.

In (very) simple English, what's inside parentheses is text content to show, what's in between (you guessed it) are positioning and formatting commands. And we are lucky that, in this trivial case, text has single-byte plain-ASCII encoding, so we can actually read it from source. If you scroll down, there are no space characters in parens. That's why, if we try to select and copy in Firefox, and paste into text editor, we'd get an ugly glued-together mess. So, the FF is even more naive about text extraction, than our CAM::PDF. The spaces appear to be present because of positioning of words.

(Of course it's not always so, for all PDF's out there. Some use spaces. Some use kerning. Some use single text object (bracketed between BT/ET pair, as the whole page in your file) per each and every character. Thing to remember -- PDF is always a machine-gen stuff on long and familiar TIMTOWTDI leash, and intended to be consumed by machines. Better not worry nor ask too many "why?")

CAM::PDF has spaces in its extracted text, -- even, as you noticed, where they should not be. It decided to play safe, but simple. Usually (not always...) text is split between text-showing operators (TJ and friends) into chunks not less than a word. So, if we want to join chunks on extraction and are lazy to analyze horizontal offsets, let's insert a space. (Actually, Adobe Reader is smart enough to add spaces where appropriate, for this file.)

===

OK, I'd try (and I did, in the past) to investigate xml produced by Ghostscript. See here. Mode "0" is low level, mode "1" tries heuristics to combine text chunks, but fails for your file, on quick and casual inspection, see further. (Note, I've seen GS "txtwrite" device to have issues/regressions in some releases, YMMV).

Mode "0", apart from top "page" level, has "char" leaf nodes, with decoded character and calculated position (and also font/size) and intermediate, but actually atomic, "spans" (the "things in parens"). It's up to you, programmer, to decide if 2 adjacent spans are single word, or they are 2 words to be separated with a space, or (with tabular data) belong to different cells.

Mode "1" tries to consolidate spans, adding spaces, but is not very good at it (see words glued together):

<block> <line> <span bbox="288 62 568 62" font="Times-Bold" size="14.0000"> <char bbox="288 62 299 62" c="A"/> <char bbox="299 62 308 62" c="B"/> <char bbox="308 62 316 62" c="S"/> <char bbox="316 62 325 62" c="T"/> <char bbox="325 62 335 62" c="R"/> <char bbox="335 62 345 62" c="A"/> <char bbox="345 62 355 62" c="C"/> <char bbox="355 62 365 62" c="T"/> <char bbox="365 62 376 62" c="O"/> <char bbox="376 62 384 62" c="F"/> <char bbox="384 62 394 62" c="U"/> <char bbox="394 62 404 62" c="N"/> <char bbox="404 62 412 62" c="S"/> <char bbox="412 62 421 62" c="T"/> <char bbox="421 62 432 62" c="R"/> <char bbox="432 62 442 62" c="A"/> <char bbox="442 62 450 62" c="P"/> <char bbox="450 62 459 62" c="P"/> <char bbox="459 62 468 62" c="E"/> <char bbox="468 62 478 62" c="D"/> <char bbox="478 62 483 62" c="("/> <char bbox="483 62 493 62" c="A"/> <char bbox="493 62 507 62" c="W"/> <char bbox="507 62 517 62" c="A"/> <char bbox="517 62 527 62" c="R"/> <char bbox="527 62 537 62" c="D"/> <char bbox="537 62 547 62" c="E"/> <char bbox="547 62 557 62" c="D"/> <char bbox="557 62 561 62" c=")"/> </span> </line> </block>

and also introduces "lines" and "blocks". Again, not too bright (halves of 2 cells in header row end up in one "block"):

<block> <line> <span bbox="415 73 498 73" font="Times-Roman" size="9.0000"> <char bbox="415 73 422 73" c="D"/> <char bbox="422 73 424 73" c="i"/> <char bbox="424 73 428 73" c="s"/> <char bbox="428 73 432 73" c="c"/> <char bbox="432 73 436 73" c="o"/> <char bbox="436 73 441 73" c="u"/> <char bbox="441 73 445 73" c="n"/> <char bbox="445 73 448 73" c="t"/> <char bbox="448 73 450 73" c=" "/> <char bbox="450 73 458 73" c="%"/> <char bbox="458 73 466 73" c=" "/> <char bbox="466 73 472 73" c="D"/> <char bbox="472 73 475 73" c="i"/> <char bbox="475 73 478 73" c="s"/> <char bbox="478 73 482 73" c="c"/> <char bbox="482 73 487 73" c="o"/> <char bbox="487 73 491 73" c="u"/> <char bbox="491 73 496 73" c="n"/> <char bbox="496 73 498 73" c="t"/> </span> </line> </block>

I'd not use mode "1", but mode "0". Find spans containing your "jacket" string. Their vertical offsets are table rows boundaries. From your 2 files, columns have constant offsets. From here you should have an idea how to find individual cells content.

Replies are listed 'Best First'.
Re^2: Regular Expression to Parse Data from a PDF
by kevyt (Scribe) on Feb 27, 2020 at 15:46 UTC
    Thanks, I used the line yesterday and it printed similar output but I could not determine how to find the correct data.
    my $str = $doc->getPageContent($p, $opts{verbose});
    1 0 0 1 516.16 280.238 Tm 0 G [ (T) 35 (imoth) 5 (y)] TJ 1 0 0 1 549.055 280.238 Tm 0 G [ (T) 74 (.)] TJ 1 0 0 1 516.16 269.238 Tm 0 G [ (Cole)] TJ 1 0 0 1 566.537 280.238 Tm 0 G [ (02/06/2020)] TJ 1 0 0 1 616.914 280.238 Tm 0 G [ (\(615\))] TJ 1 0 0 1 638.658 280.238 Tm 0 G [ (713-0205)] TJ 1 0 0 1 692.48 280.238 Tm 0 G [ (These)] TJ 1 0 0 1 716.222 280.238 Tm 0 G [ (are)] TJ 1 0 0 1 729.461 280.238 Tm 0 G [ (for)] TJ 1 0 0 1 742.205 280.238 Tm 0 G [ (a)] TJ 1 0 0 1 748.451 280.238 Tm 0 G [ (total)] TJ 1 0 0 1 766.703 280.238 Tm 0 G [ (of)] TJ 1 0 0 1 776.45 280.238 Tm 0 G [ (25,000)] TJ 1 0 0 1 692.48 269.238 Tm 0 G [ (total)] TJ 1 0 0 1 710.732 269.238 Tm 0 G [ (lan) 15 (yards)] TJ 1 0 0 1 743.339 269.238 Tm 0 G [ (made)] TJ 1 0 0 1 765.083 269.238 Tm 0 G [ (o) 15 (v) 15 (erseas)] TJ 1 0 0 1 692.48 258.238 Tm 0 G [ (and)] TJ 1 0 0 1 707.726 258.238 Tm 0 G

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11113483]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-24 21:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found