Regular Expression to Parse Data from a PDF

kevyt has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expression to Parse Data from a PDF by kcott (Archbishop) on Feb 27, 2020 at 11:50 UTC
G'day Kevin, Firstly, I'm not a user of CAM::PDF; in fact, I didn't even have it installed. I suspect the `getPageText()` method is not the best choice for this: as you noted, you can't split lines easily and dollar amounts have an embedded space — I can't advise of a better choice; perhaps another monk can. I would strongly recommend that you do not write lengthy regexes the way you did in the last example in your code: they are incredibly difficult to read; even more difficult to maintain; and extremely error-prone. See my code below for a much better way to do this. Also, take a look at Regexp::Debugger: I find it very helpful and, in fact, used it to check some of the fiddlier parts of the regex in the code below. I've ignored the PDF download part of the code. You didn't ask about that: I'm assuming you've got that working satisfactorily. I just downloaded the two PDFs you referenced and accessed them from a local disk. Some notes on how I've dealt with lack of information: You said you wanted to "capture all of the columns except comments". You said nothing about header information or the various totals, so I've simply ignored them. You also said nothing about output: I've captured the columnar data; I'll leave you to decide what you want to do with it. The two example PDFs you linked only had one page each, so looping through all pages seems somewhat superfluous; however, I've left that `for` loop almost exactly as you wrote it. It's also unclear whether you want to capture data by page, document, or some other grouping: again, I'll leave you to decide. Here's the code: #!/usr/bin/env perl use strict; use warnings; use constant { AMOUNT => 3, ADDL_RATE_PER => 4, DISCOUNT_PRICE => 6, }; use CAM::PDF; use Data::Dumper; my $jacket_id = $ARGV[0]; my $pdf_file = "pm_11113472_$jacket_id.pdf"; my $pdf = CAM::PDF::->new($pdf_file) or die $CAM::PDF::errstr; my $re = qr{(?x: \A \s*? ((?:A\|)) # Awd \s+ (\d+-\d+) # Contractor Code \s+ ([^\$]+?) # Name \s+ (\$\s[0-9,.]+) # Amount \s+ (\$\s[0-9,.]+\s[A-Z]) # Add'l Rate/PER \s+ ([0-9.]+\s+\d+) # Discount % Days \s+ (\$\s[0-9,.]+) # Discount Price \s+ ([\D]+?) # Bidders Name \s+ (\S+) # Date Received \s+ ($\d+$\s\d+-\d+) # Phone Number )}; for my $page_num (1 .. $pdf->numPages) { my $text = $pdf->getPageText($page_num); my @lines; my $wanted_line = 0; for my $line (split /$jacket_id/, $text) { next unless $wanted_line++; my @fields = $line =~ $re; $fields[AMOUNT] =~ y/ //d; $fields[ADDL_RATE_PER] =~ s/ //; $fields[DISCOUNT_PRICE] =~ y/ //d; push @lines, [ $jacket_id, @fields ]; } print Dumper(\@lines); } [download] Here's the first part of the output using your first example PDF: `$ ./pm_11113472_pdf_parse.pl 746810 $VAR1 = [ [ '746810', 'A', '140-89226', 'UNION HOERMANN PRESS', '$844.00', '$15.00 C', '1 20', '$835.56', 'Randy Sigman', '01/22/2020', '(563) 582-3631' ], [ '746810', '', '190-38407', 'GRAPHIC VISIONS', '$869.00', '$140.00 M', '0.5 20', '$864.66', 'Howard Roskosky', '01/22/2020', '(301) 987-5586' ],` [download] Open the spoiler to see full output for both example PDFs. <Reveal this spoiler or all in this thread> — Ken	[reply] [d/l] [select]
Re^2: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 27, 2020 at 16:35 UTC
This is very cool. Thanks! The next step will be to grab the Title, Quantity and a few fields from links like this one. https://contractorconnection.gpo.gov/RequestOpenJobs/770893 Thanks very much!	[reply]
Re^2: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 28, 2020 at 02:25 UTC
Ken, Thanks very much for your help. It's working great but I forgot about one issue. They might add a "R-1" or "R-2" to the far left column if there is a revision. I have not used perl much since 2006 and I rarely used regex. I also tried to get some of the comments but that wont be importing going forward. Example with R-1 https://contractorconnection.gpo.gov/abstract/777292 Example without R-1 https://contractorconnection.gpo.gov/abstract/777293 I also need to install CAM::PDF so I can run it on linux. #!/usr/bin/perl -w # use warnings; # use strict; use CAM::PDF; use LWP::Simple; use Data::Dumper; use constant { AMOUNT => 0, ADDL_RATE_PER => 0, DISCOUNT_PRICE => 0, }; #### These will be used to load different database tables ##### $companies = 'c:\Users\Kevin\Documents\dev\data_files\gpo_companies.cs +v'; $bids = 'c:\Users\Kevin\Documents\dev\data_files\gpo_bids.csv'; $awards = 'c:\Users\Kevin\Documents\dev\data_files\gpo_awards.csv'; $solicit = 'c:\Users\Kevin\Documents\dev\data_files\gpo_solicitations. +csv'; $log_file = 'c:\Users\Kevin\Documents\dev\data_files\gpo_log.csv'; #### This file will be imported into excel (temp. solution so I won't +have to create the db tables now) $all_file = 'c:\Users\Kevin\Documents\dev\data_files\gpo_abstract_data +.csv'; open (COMPANY, ">> $companies") or die ("Can't open the output file $! +"); open (BID, ">> $bids") or die ("Can't open the output file $!"); open (AWARD, ">> $awards") or die ("Can't open the output file $!"); open (SOLICIT, ">> $solicit") or die ("Can't open the output file $!") +; open (LOG, ">> $log_file") or die ("Can't open the output file $!"); open (OUT, ">> $all_file") or die ("Can't open the output file $!"); print OUT "Jacket_ID,Award,Contractor_Code,Company_Name,Amount,Addl_Ra +te,Addl_Rate_Per,Discount_Percent,Discount_Days,Discount_Price,Bidder +s_Name,Date_Received,Phone_Number\n"; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, /', 'Accept-Charset' => 'iso-8859-1,,utf-8', 'Accept-Language' => 'en-US', ); my $jacket_id = 777390; # Get the most recent data first while ($jacket_id > 700000){ sleep (2); $jacket_id --; my $ua = LWP::UserAgent->new; $ua->timeout(5); # Is the site available? print $jacket_id . "\n"; my $response = $ua->get('https://contractorconnection.gpo.gov/abst +ract/'. $jacket_id , @ns_headers); if ( $response =~ /Abstract Unavailable/){ print LOG $jacket_id . ",Unavailable\n"; next; } my $pdf = CAM::PDF->new($response->content) \|\| print LOG $jacket_i +d . ",ERROR,\n". next; my $re = qr{(?x: \A \s? ((?:A\|)) # Awd - 0 \s+ (\d+-\d+) # Contractor Code - 1 \s+ ([^\$]+?) # Name - 2 \s+ (\$\s[0-9,.]+) # Amount - 3 \s+ ### (\$\s[0-9,.]+\s[A-Z]) # Add'l Rate/PER - 4 (\$\s[0-9,.]+) # Add'l Rate - 4 \s+ ([^\$]+?) # Add'l Rate's Per - 5 \s+ ### ([0-9.]+\s+\d+) # Discount % Days - 6 ([0-9.]+) # Discount % - 6 \s+ (\d+) # Discount Days - 7 \s+ (\$\s[0-9,.]+) # Discount Price - 8 \s+ ([\D]+?) # Bidders Name - 9 \s+ (\S+) # Date Received - 10 \s+ ($\d+$\s\d+-\d+) # Phone Number - 11 )}; for my $page_num (1 .. $pdf->numPages) { my $text = $pdf->getPageText($page_num); my @lines; my $wanted_line = 0; for my $line (split /$jacket_id/, $text) { # print $line; next unless $wanted_line++; my @fields = $line =~ $re; $fields[AMOUNT] =~ y/ //d; $fields[ADDL_RATE_PER] =~ s/ //; $fields[DISCOUNT_PRICE] =~ y/ //d; $fields[3] =~ s/\s+//g; # Remove the space between the $ a +nd digit $fields[4] =~ s/\s+//g; # Remove the space between the $ a +nd digit $fields[8] =~ s/\s+//g; # Remove the space between the $ a +nd digit foreach (@fields){ $_ =~ s/\,//; } push @lines, [ $jacket_id, @fields ]; # Contractor Code Company Name + Bidders Name Phone Number print COMPANY $fields[1] . ",". $fields[2] . ",". $fields[ +9] . ",". $fields[11] . "\n"; # Title Quantity Contact Winning_C +ontractor print SOLICIT $jacket_id . ",,,,". $fields[1] . "\n"; if($fields[0] =~ /A/){ # Contractor Code +Date Received print AWARD $jacket_id . ",". $fields[1] . ",". $fiel +ds[10] . "\n"; } # Contractor Code Amount + Add'l Rate Add'l Rate's Per Discount Days +Discount % Discount Price print BID $jacket_id . ",". $fields[1] . ",". $fields[3] +. ",". $fields[4] . ",". $fields[5] . ",". $fields[7] . ",". $fields[ +6] . ",". $fields[8] . "\n"; print OUT $jacket_id . ",". $fields[0] . ",". $fields[1] +. ",". $fields[2] . ",". $fields[3] . ",". $fields[4] . ",". $fields[ +5] . ",". $fields[6] . ",". $fields[7] . ",". $fields[8] . ",". $fields[9] + . ",". $fields[10] . ",". $fields[11] . "\n"; # foreach my $field (@fields){ # print $field . ","; # } # print "\n"; } # print Dumper(\@lines); } } # End while () [download]	[reply] [d/l]
Re^3: Regular Expression to Parse Data from a PDF by kcott (Archbishop) on Feb 28, 2020 at 06:28 UTC
'They might add a "R-1" or "R-2" to the far left column if there is a revision.' You just need to extend the regex to handle that. Here's an example: `#!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my $jid = '777'; my $text = 'header 777 111 777 A 222 777R-1 333 777R-2 A 444'; my $re = qr{(?x: \A (R-\d+\|) \s? (A\|) \s (\d+) )}; my @lines; my $wanted_line = 0; for my $line (split /$jid/, $text) { next unless $wanted_line++; my @fields = $line =~ $re; push @lines, [ $jid . shift(@fields), @fields ]; } print Dumper(\@lines);` [download] Output: `$VAR1 = [ [ '777', '', '111' ], [ '777', 'A', '222' ], [ '777R-1', '', '333' ], [ '777R-2', 'A', '444' ] ];` [download] `print ... $fields[1] . ",". $fields[3] . ",". $fields[4] . ",". ...`* Here's an example to show a better way to handle that: `$ perl -e 'my @x = qw{a b c d e f}; print join ",", @x[0,3,4]' a,d,e` [download] On an unrelated note, there are problems with your open statements. Use of package variables can lead to all sorts of bugs that are hard to track down. Your six error messages are identical: how will you know which file generates "Can't open the output file ...". Look to using lexical filehandles and the 3-argument form of open. Consider the autodie pragma — you'll do less work and get better error reporting. — Ken	[reply] [d/l] [select]
Re^4: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 28, 2020 at 06:52 UTC
Re: Regular Expression to Parse Data from a PDF by vr (Curate) on Feb 27, 2020 at 12:13 UTC
(OT, not really Perl) That approach won't work, in general. Text extraction from PDF always involves some level of heuristics, especially with tables and/or formatting. CAM::PDF is very naive about extraction and is good for simple checks only, for limited subset of plain English. You may wish to take a look at `CAM::PDF::getPageContent` output: BT /Times 10 Tf 1 0 0 1 50.172 549.238 Tm 0 G [ (2/27/2020)] TJ 1 0 0 1 93.232 549.238 Tm 0 G [ (3:07)] TJ 1 0 0 1 113.512 549.238 Tm 0 G [ (AM)] TJ 1 0 0 1 428.004 549.238 Tm 0 G [ -24977 (Quotations)] TJ 1 0 0 1 724.166 549.238 Tm 0 G [ (Due)] TJ 1 0 0 1 743.326 549.238 Tm 0 G [ (By:)] TJ 1 0 0 1 760.276 549.238 Tm 0 G [ (01/22/2020)] TJ /TimesB 14 Tf 1 0 0 1 50.172 533.238 Tm 0 G [ -17016 (ABSTRA) 55 (CT)] TJ 1 0 0 1 367.356 533.238 Tm 0 G [ (OF)] TJ 1 0 0 1 390.302 533.238 Tm 0 G [ (UNSTRAPPED)] TJ 1 0 0 1 487.91 533.238 Tm 0 G [ ($A) 130 (W) 120 (ARDED$)] TJ /Times 9 Tf 1 0 0 1 50.172 522.238 Tm 0 G [ (Jack) 10 (et)] TJ 1 0 0 1 100.549 522.238 Tm 0 G [ (A) 92 (wd)] TJ 1 0 0 1 150.926 522.238 Tm 0 G [ (Contractor)] TJ 1 0 0 1 150.926 511.238 Tm 0 G [ (Code)] TJ 1 0 0 1 201.303 522.238 Tm 0 G [ (Name)] TJ %... etc. [download] In (very) simple English, what's inside parentheses is text content to show, what's in between (you guessed it) are positioning and formatting commands. And we are lucky that, in this trivial case, text has single-byte plain-ASCII encoding, so we can actually read it from source. If you scroll down, there are no space characters in parens. That's why, if we try to select and copy in Firefox, and paste into text editor, we'd get an ugly glued-together mess. So, the FF is even more naive about text extraction, than our CAM::PDF. The spaces appear to be present because of positioning of words. (Of course it's not always so, for all PDF's out there. Some use spaces. Some use kerning. Some use single text object (bracketed between BT/ET pair, as the whole page in your file) per each and every character. Thing to remember -- PDF is always a machine-gen stuff on long and familiar TIMTOWTDI leash, and intended to be consumed by machines. Better not worry nor ask too many "why?") CAM::PDF has spaces in its extracted text, -- even, as you noticed, where they should not be. It decided to play safe, but simple. Usually (not always...) text is split between text-showing operators (TJ and friends) into chunks not less than a word. So, if we want to join chunks on extraction and are lazy to analyze horizontal offsets, let's insert a space. (Actually, Adobe Reader is smart enough to add spaces where appropriate, for this file.) === OK, I'd try (and I did, in the past) to investigate xml produced by Ghostscript. See here. Mode "0" is low level, mode "1" tries heuristics to combine text chunks, but fails for your file, on quick and casual inspection, see further. (Note, I've seen GS "txtwrite" device to have issues/regressions in some releases, YMMV). Mode "0", apart from top "page" level, has "char" leaf nodes, with decoded character and calculated position (and also font/size) and intermediate, but actually atomic, "spans" (the "things in parens"). It's up to you, programmer, to decide if 2 adjacent spans are single word, or they are 2 words to be separated with a space, or (with tabular data) belong to different cells. Mode "1" tries to consolidate spans, adding spaces, but is not very good at it (see words glued together): <block> <line> <span bbox="288 62 568 62" font="Times-Bold" size="14.0000"> <char bbox="288 62 299 62" c="A"/> <char bbox="299 62 308 62" c="B"/> <char bbox="308 62 316 62" c="S"/> <char bbox="316 62 325 62" c="T"/> <char bbox="325 62 335 62" c="R"/> <char bbox="335 62 345 62" c="A"/> <char bbox="345 62 355 62" c="C"/> <char bbox="355 62 365 62" c="T"/> <char bbox="365 62 376 62" c="O"/> <char bbox="376 62 384 62" c="F"/> <char bbox="384 62 394 62" c="U"/> <char bbox="394 62 404 62" c="N"/> <char bbox="404 62 412 62" c="S"/> <char bbox="412 62 421 62" c="T"/> <char bbox="421 62 432 62" c="R"/> <char bbox="432 62 442 62" c="A"/> <char bbox="442 62 450 62" c="P"/> <char bbox="450 62 459 62" c="P"/> <char bbox="459 62 468 62" c="E"/> <char bbox="468 62 478 62" c="D"/> <char bbox="478 62 483 62" c="("/> <char bbox="483 62 493 62" c="A"/> <char bbox="493 62 507 62" c="W"/> <char bbox="507 62 517 62" c="A"/> <char bbox="517 62 527 62" c="R"/> <char bbox="527 62 537 62" c="D"/> <char bbox="537 62 547 62" c="E"/> <char bbox="547 62 557 62" c="D"/> <char bbox="557 62 561 62" c=")"/> </span> </line> </block> [download] and also introduces "lines" and "blocks". Again, not too bright (halves of 2 cells in header row end up in one "block"): <block> <line> <span bbox="415 73 498 73" font="Times-Roman" size="9.0000"> <char bbox="415 73 422 73" c="D"/> <char bbox="422 73 424 73" c="i"/> <char bbox="424 73 428 73" c="s"/> <char bbox="428 73 432 73" c="c"/> <char bbox="432 73 436 73" c="o"/> <char bbox="436 73 441 73" c="u"/> <char bbox="441 73 445 73" c="n"/> <char bbox="445 73 448 73" c="t"/> <char bbox="448 73 450 73" c=" "/> <char bbox="450 73 458 73" c="%"/> <char bbox="458 73 466 73" c=" "/> <char bbox="466 73 472 73" c="D"/> <char bbox="472 73 475 73" c="i"/> <char bbox="475 73 478 73" c="s"/> <char bbox="478 73 482 73" c="c"/> <char bbox="482 73 487 73" c="o"/> <char bbox="487 73 491 73" c="u"/> <char bbox="491 73 496 73" c="n"/> <char bbox="496 73 498 73" c="t"/> </span> </line> </block> [download] I'd not use mode "1", but mode "0". Find spans containing your "jacket" string. Their vertical offsets are table rows boundaries. From your 2 files, columns have constant offsets. From here you should have an idea how to find individual cells content.	[reply] [d/l] [select]
Re^2: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 27, 2020 at 15:46 UTC
Thanks, I used the line yesterday and it printed similar output but I could not determine how to find the correct data. `my $str = $doc->getPageContent($p, $opts{verbose});` [download] 1 0 0 1 516.16 280.238 Tm 0 G [ (T) 35 (imoth) 5 (y)] TJ 1 0 0 1 549.055 280.238 Tm 0 G [ (T) 74 (.)] TJ 1 0 0 1 516.16 269.238 Tm 0 G [ (Cole)] TJ 1 0 0 1 566.537 280.238 Tm 0 G [ (02/06/2020)] TJ 1 0 0 1 616.914 280.238 Tm 0 G [ ($615$)] TJ 1 0 0 1 638.658 280.238 Tm 0 G [ (713-0205)] TJ 1 0 0 1 692.48 280.238 Tm 0 G [ (These)] TJ 1 0 0 1 716.222 280.238 Tm 0 G [ (are)] TJ 1 0 0 1 729.461 280.238 Tm 0 G [ (for)] TJ 1 0 0 1 742.205 280.238 Tm 0 G [ (a)] TJ 1 0 0 1 748.451 280.238 Tm 0 G [ (total)] TJ 1 0 0 1 766.703 280.238 Tm 0 G [ (of)] TJ 1 0 0 1 776.45 280.238 Tm 0 G [ (25,000)] TJ 1 0 0 1 692.48 269.238 Tm 0 G [ (total)] TJ 1 0 0 1 710.732 269.238 Tm 0 G [ (lan) 15 (yards)] TJ 1 0 0 1 743.339 269.238 Tm 0 G [ (made)] TJ 1 0 0 1 765.083 269.238 Tm 0 G [ (o) 15 (v) 15 (erseas)] TJ 1 0 0 1 692.48 258.238 Tm 0 G [ (and)] TJ 1 0 0 1 707.726 258.238 Tm 0 G [download]	[reply] [d/l] [select]
Re: Regular Expression to Parse Data from a PDF by LanX (Saint) on Feb 27, 2020 at 10:34 UTC
Hi Kevin If you want help with regex, then you should better show us the input strings and the desired results. See also SSCCE On a side note: I'm personally using `pdftohtml -xml` to parse pdf. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^2: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 27, 2020 at 15:30 UTC
Thank you. I will load it into MySQL or Excel. So, a comma separated file will work the best. Thanks, Kevin	[reply]
Re: Regular Expression to Parse Data from a PDF by brostad (Monk) on Feb 27, 2020 at 12:49 UTC
I think your code mostly works, I downloaded it and just modified the regular expression to this: # Expected formats # >>>A 430-08870 BKR PRINTING $ 1,090.00 $ 155.00 M 5 20 $ 1,035.5 +0 Mark Bengtzen 01/22/2020 (801) 532-5363<<< # >>>420-31784 GRAFIKSHOP CORP. DBA FALCON $ 945.00 $ 110.00 M 1 2 +0 $ 935.55 Mei-Ing Hoffman 01/22/2020 (713) 977-2555<<< if ( $line =~ m/^ (?<Awd> \w \s+)? (?<ContractorCode> \d{3}-\d{5}) \s+ (?<Name> \S [^\$]+) \s+ \$ \s* (?<Amount> \S+) \s+ \$ \s* (?<AddlRate> \S+ \s+ \w) \s+ (?<DiscountDays> \S+) \s+ (?<DiscountPercent> \S+) \s+ \$ \s* (?<DiscountPrice> \S+ ) \s+ (?<Bidder> \w \D+ \S ) \s+ (?<DateReceived> \d\d\/\d\d\/\d{4} ) \s+ (?<PhoneNum> [(]\d{3}[)] \s \d{3}-\d{4} ) /x ) { say "Found Contractor '", $+{Name}, "' (", $+{ContractorCode}, + ") and bidder '", $+{Bidder} , "' (", $+{DateReceived},")"; } else { say "Failed to parse line: >>>", $line, "<<<"; } [download] This works for all the rows (except headline) on https://contractorconnection.gpo.gov/abstract/746810	[reply] [d/l]
Re: Regular Expression to Parse Data from a PDF by Fletch (Bishop) on Feb 27, 2020 at 18:44 UTC
Another possibility to try: if you're on a linux-y system and have the poppler package available which has a `pdftotext` (RHEL has it in its 'poppler-utils' RPM) that might work for you. Open a pipe from something like `pdftotext -layout foo.pdf -` and see if that gets what you need from your PDF for your purposes. The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l] [select]
Re^2: Regular Expression to Parse Data from a PDF by LanX (Saint) on Feb 27, 2020 at 21:04 UTC
> Open a pipe from something like 'pdftotext -layout' Not really, `pdftohtml -xml` is far better, see Parsing PDFs by text position? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]