laziness, impatience, and hubris | |
PerlMonks |
Re: Regular Expression to Parse Data from a PDFby vr (Curate) |
on Feb 27, 2020 at 12:13 UTC ( [id://11113483]=note: print w/replies, xml ) | Need Help?? |
(OT, not really Perl) That approach won't work, in general. Text extraction from PDF always involves some level of heuristics, especially with tables and/or formatting. CAM::PDF is very naive about extraction and is good for simple checks only, for limited subset of plain English. You may wish to take a look at CAM::PDF::getPageContent output:
In (very) simple English, what's inside parentheses is text content to show, what's in between (you guessed it) are positioning and formatting commands. And we are lucky that, in this trivial case, text has single-byte plain-ASCII encoding, so we can actually read it from source. If you scroll down, there are no space characters in parens. That's why, if we try to select and copy in Firefox, and paste into text editor, we'd get an ugly glued-together mess. So, the FF is even more naive about text extraction, than our CAM::PDF. The spaces appear to be present because of positioning of words. (Of course it's not always so, for all PDF's out there. Some use spaces. Some use kerning. Some use single text object (bracketed between BT/ET pair, as the whole page in your file) per each and every character. Thing to remember -- PDF is always a machine-gen stuff on long and familiar TIMTOWTDI leash, and intended to be consumed by machines. Better not worry nor ask too many "why?") CAM::PDF has spaces in its extracted text, -- even, as you noticed, where they should not be. It decided to play safe, but simple. Usually (not always...) text is split between text-showing operators (TJ and friends) into chunks not less than a word. So, if we want to join chunks on extraction and are lazy to analyze horizontal offsets, let's insert a space. (Actually, Adobe Reader is smart enough to add spaces where appropriate, for this file.) === OK, I'd try (and I did, in the past) to investigate xml produced by Ghostscript. See here. Mode "0" is low level, mode "1" tries heuristics to combine text chunks, but fails for your file, on quick and casual inspection, see further. (Note, I've seen GS "txtwrite" device to have issues/regressions in some releases, YMMV). Mode "0", apart from top "page" level, has "char" leaf nodes, with decoded character and calculated position (and also font/size) and intermediate, but actually atomic, "spans" (the "things in parens"). It's up to you, programmer, to decide if 2 adjacent spans are single word, or they are 2 words to be separated with a space, or (with tabular data) belong to different cells. Mode "1" tries to consolidate spans, adding spaces, but is not very good at it (see words glued together):
and also introduces "lines" and "blocks". Again, not too bright (halves of 2 cells in header row end up in one "block"):
I'd not use mode "1", but mode "0". Find spans containing your "jacket" string. Their vertical offsets are table rows boundaries. From your 2 files, columns have constant offsets. From here you should have an idea how to find individual cells content.
In Section
Seekers of Perl Wisdom
|
|