Re: Regular Expression to Parse Data from a PDF

G'day Kevin,

Firstly, I'm not a user of CAM::PDF; in fact, I didn't even have it installed. I suspect the getPageText() method is not the best choice for this: as you noted, you can't split lines easily and dollar amounts have an embedded space — I can't advise of a better choice; perhaps another monk can.

I would strongly recommend that you do not write lengthy regexes the way you did in the last example in your code: they are incredibly difficult to read; even more difficult to maintain; and extremely error-prone. See my code below for a much better way to do this. Also, take a look at Regexp::Debugger: I find it very helpful and, in fact, used it to check some of the fiddlier parts of the regex in the code below.

I've ignored the PDF download part of the code. You didn't ask about that: I'm assuming you've got that working satisfactorily. I just downloaded the two PDFs you referenced and accessed them from a local disk.

Some notes on how I've dealt with lack of information:

You said you wanted to "capture all of the columns except comments". You said nothing about header information or the various totals, so I've simply ignored them.
You also said nothing about output: I've captured the columnar data; I'll leave you to decide what you want to do with it.
The two example PDFs you linked only had one page each, so looping through all pages seems somewhat superfluous; however, I've left that for loop almost exactly as you wrote it.
It's also unclear whether you want to capture data by page, document, or some other grouping: again, I'll leave you to decide.

Here's the code:

#!/usr/bin/env perl

use strict;
use warnings;

use constant {
    AMOUNT => 3,
    ADDL_RATE_PER => 4,
    DISCOUNT_PRICE => 6,
};

use CAM::PDF;
use Data::Dumper;

my $jacket_id = $ARGV[0];
my $pdf_file = "pm_11113472_$jacket_id.pdf";

my $pdf = CAM::PDF::->new($pdf_file) or die $CAM::PDF::errstr;

my $re = qr{(?x:
    \A
    \s*?
    ((?:A|))                # Awd
    \s+
    (\d+-\d+)               # Contractor Code
    \s+
    ([^\$]+?)               # Name
    \s+
    (\$\s[0-9,.]+)          # Amount
    \s+
    (\$\s[0-9,.]+\s[A-Z])   # Add'l Rate/PER
    \s+
    ([0-9.]+\s+\d+)         # Discount % Days
    \s+
    (\$\s[0-9,.]+)          # Discount Price
    \s+
    ([\D]+?)                # Bidders Name
    \s+
    (\S+)                   # Date Received
    \s+
    (\(\d+\)\s\d+-\d+)      # Phone Number
)};

for my $page_num (1 .. $pdf->numPages) {
    my $text = $pdf->getPageText($page_num);
    my @lines;
    my $wanted_line = 0;

    for my $line (split /$jacket_id/, $text) {
        next unless $wanted_line++;
        my @fields = $line =~ $re;
        $fields[AMOUNT] =~ y/ //d;
        $fields[ADDL_RATE_PER] =~ s/ //;
        $fields[DISCOUNT_PRICE] =~ y/ //d;
        push @lines, [ $jacket_id, @fields ];
    }

    print Dumper(\@lines);
}
[download]

Here's the first part of the output using your first example PDF:

$ ./pm_11113472_pdf_parse.pl 746810
$VAR1 = [
          [
            '746810',
            'A',
            '140-89226',
            'UNION HOERMANN PRESS',
            '$844.00',
            '$15.00 C',
            '1 20',
            '$835.56',
            'Randy Sigman',
            '01/22/2020',
            '(563) 582-3631'
          ],
          [
            '746810',
            '',
            '190-38407',
            'GRAPHIC VISIONS',
            '$869.00',
            '$140.00 M',
            '0.5 20',
            '$864.66',
            'Howard Roskosky',
            '01/22/2020',
            '(301) 987-5586'
          ],
[download]

Open the spoiler to see full output for both example PDFs.

$ ./pm_11113472_pdf_parse.pl 746810
$VAR1 = [
          [
            '746810',
            'A',
            '140-89226',
            'UNION HOERMANN PRESS',
            '$844.00',
            '$15.00 C',
            '1 20',
            '$835.56',
            'Randy Sigman',
            '01/22/2020',
            '(563) 582-3631'
          ],
          [
            '746810',
            '',
            '190-38407',
            'GRAPHIC VISIONS',
            '$869.00',
            '$140.00 M',
            '0.5 20',
            '$864.66',
            'Howard Roskosky',
            '01/22/2020',
            '(301) 987-5586'
          ],
          [
            '746810',
            '',
            '040-13121',
            'BONADA ENTERPRISES/BLUE EARTH',
            '$902.00',
            '$0.18 E',
            '1 7',
            '$902.00',
            'fernando',
            '01/22/2020',
            '(323) 272-6430'
          ],
          [
            '746810',
            '',
            '420-52700',
            'LITHO PRESS, INC.',
            '$941.00',
            '$18.00 C',
            '1 20',
            '$931.59',
            'Tim Sankey',
            '01/22/2020',
            '(210) 333-1711'
          ],
          [
            '746810',
            '',
            '420-31784',
            'GRAFIKSHOP CORP. DBA FALCON',
            '$945.00',
            '$110.00 M',
            '1 20',
            '$935.55',
            'Mei-Ing Hoffman',
            '01/22/2020',
            '(713) 977-2555'
          ],
          [
            '746810',
            '',
            '430-08870',
            'BKR PRINTING',
            '$1,090.00',
            '$155.00 M',
            '5 20',
            '$1,035.50',
            'Mark Bengtzen',
            '01/22/2020',
            '(801) 532-5363'
          ],
          [
            '746810',
            '',
            '190-28460',
            'DOYLE PRINTING',
            '$1,177.00',
            '$227.00 M',
            '5 20',
            '$1,118.15',
            'Michael Carey',
            '01/22/2020',
            '(301) 991-2637'
          ],
          [
            '746810',
            '',
            '120-71652',
            'PRODUCTION PRESS, INC.',
            '$1,357.00',
            '$232.00 M',
            '0.25 20',
            '$1,353.61',
            'Brad Racey',
            '01/22/2020',
            '(217) 243-3353'
          ],
          [
            '746810',
            '',
            '450-34976',
            'GABRO GRAPHICS INC.',
            '$1,940.00',
            '$295.00 M',
            '2 20',
            '$1,901.20',
            'Tony Gabro',
            '01/22/2020',
            '(703) 464-8588'
          ],
          [
            '746810',
            '',
            '130-13540',
            'BOWMAN DISPLAY DIGITAL IMAGING',
            '$9,327.91',
            '$1.86 E',
            '0 0',
            '$9,327.91',
            'Sara Veld',
            '01/22/2020',
            '(219) 595-6542'
          ]
        ];
[download]

$ ./pm_11113472_pdf_parse.pl 746819
$VAR1 = [
          [
            '746819',
            'A',
            '120-64255',
            'NOOR INTERNATIONAL CORP',
            '$387.86',
            '$7.75 C',
            '1 20',
            '$383.98',
            'Max Saleem',
            '01/23/2020',
            '(847) 985-2300'
          ],
          [
            '746819',
            '',
            '040-44026',
            'IMAGE SQUARE INC',
            '$463.00',
            '$0.09 E',
            '0 0',
            '$463.00',
            'Ash Soudbash',
            '01/22/2020',
            '(310) 586-2333'
          ],
          [
            '746819',
            '',
            '190-43435',
            'HUB LABELS, INC.',
            '$731.00',
            '$14.62 C',
            '1 20',
            '$723.69',
            'Kim Clark',
            '01/23/2020',
            '(301) 671-2230'
          ],
          [
            '746819',
            '',
            '090-28380',
            'DOUGLASS SCREEN PRINTERS',
            '$800.00',
            '$140.00 M',
            '0.5 20',
            '$796.00',
            'Debbie Carrigan',
            '01/23/2020',
            '(863) 899-7130'
          ],
          [
            '746819',
            '',
            '480-79295',
            'SERIGRAPHIC SCREEN PRINT',
            '$800.00',
            '$0.16 E',
            '0.5 20',
            '$796.00',
            'Teri Tropple',
            '01/22/2020',
            '(800) 657-6740'
          ],
          [
            '746819',
            '',
            '120-77235',
            'DRI-STICK DECAL/RYDIN DECAL',
            '$1,150.00',
            '$0.00 N',
            '0 0',
            '$1,150.00',
            'Lori Haberstich',
            '01/23/2020',
            '(800) 448-1991'
          ]
        ];
[download]

— Ken

Comment on Re: Regular Expression to Parse Data from a PDF Select or Download Code

Replies are listed 'Best First'.
Re^2: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 27, 2020 at 16:35 UTC
This is very cool. Thanks! The next step will be to grab the Title, Quantity and a few fields from links like this one. https://contractorconnection.gpo.gov/RequestOpenJobs/770893 Thanks very much!	[reply]
Re^2: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 28, 2020 at 02:25 UTC
Ken, Thanks very much for your help. It's working great but I forgot about one issue. They might add a "R-1" or "R-2" to the far left column if there is a revision. I have not used perl much since 2006 and I rarely used regex. I also tried to get some of the comments but that wont be importing going forward. Example with R-1 https://contractorconnection.gpo.gov/abstract/777292 Example without R-1 https://contractorconnection.gpo.gov/abstract/777293 I also need to install CAM::PDF so I can run it on linux. #!/usr/bin/perl -w # use warnings; # use strict; use CAM::PDF; use LWP::Simple; use Data::Dumper; use constant { AMOUNT => 0, ADDL_RATE_PER => 0, DISCOUNT_PRICE => 0, }; #### These will be used to load different database tables ##### $companies = 'c:\Users\Kevin\Documents\dev\data_files\gpo_companies.cs +v'; $bids = 'c:\Users\Kevin\Documents\dev\data_files\gpo_bids.csv'; $awards = 'c:\Users\Kevin\Documents\dev\data_files\gpo_awards.csv'; $solicit = 'c:\Users\Kevin\Documents\dev\data_files\gpo_solicitations. +csv'; $log_file = 'c:\Users\Kevin\Documents\dev\data_files\gpo_log.csv'; #### This file will be imported into excel (temp. solution so I won't +have to create the db tables now) $all_file = 'c:\Users\Kevin\Documents\dev\data_files\gpo_abstract_data +.csv'; open (COMPANY, ">> $companies") or die ("Can't open the output file $! +"); open (BID, ">> $bids") or die ("Can't open the output file $!"); open (AWARD, ">> $awards") or die ("Can't open the output file $!"); open (SOLICIT, ">> $solicit") or die ("Can't open the output file $!") +; open (LOG, ">> $log_file") or die ("Can't open the output file $!"); open (OUT, ">> $all_file") or die ("Can't open the output file $!"); print OUT "Jacket_ID,Award,Contractor_Code,Company_Name,Amount,Addl_Ra +te,Addl_Rate_Per,Discount_Percent,Discount_Days,Discount_Price,Bidder +s_Name,Date_Received,Phone_Number\n"; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, /', 'Accept-Charset' => 'iso-8859-1,,utf-8', 'Accept-Language' => 'en-US', ); my $jacket_id = 777390; # Get the most recent data first while ($jacket_id > 700000){ sleep (2); $jacket_id --; my $ua = LWP::UserAgent->new; $ua->timeout(5); # Is the site available? print $jacket_id . "\n"; my $response = $ua->get('https://contractorconnection.gpo.gov/abst +ract/'. $jacket_id , @ns_headers); if ( $response =~ /Abstract Unavailable/){ print LOG $jacket_id . ",Unavailable\n"; next; } my $pdf = CAM::PDF->new($response->content) \|\| print LOG $jacket_i +d . ",ERROR,\n". next; my $re = qr{(?x: \A \s? ((?:A\|)) # Awd - 0 \s+ (\d+-\d+) # Contractor Code - 1 \s+ ([^\$]+?) # Name - 2 \s+ (\$\s[0-9,.]+) # Amount - 3 \s+ ### (\$\s[0-9,.]+\s[A-Z]) # Add'l Rate/PER - 4 (\$\s[0-9,.]+) # Add'l Rate - 4 \s+ ([^\$]+?) # Add'l Rate's Per - 5 \s+ ### ([0-9.]+\s+\d+) # Discount % Days - 6 ([0-9.]+) # Discount % - 6 \s+ (\d+) # Discount Days - 7 \s+ (\$\s[0-9,.]+) # Discount Price - 8 \s+ ([\D]+?) # Bidders Name - 9 \s+ (\S+) # Date Received - 10 \s+ ($\d+$\s\d+-\d+) # Phone Number - 11 )}; for my $page_num (1 .. $pdf->numPages) { my $text = $pdf->getPageText($page_num); my @lines; my $wanted_line = 0; for my $line (split /$jacket_id/, $text) { # print $line; next unless $wanted_line++; my @fields = $line =~ $re; $fields[AMOUNT] =~ y/ //d; $fields[ADDL_RATE_PER] =~ s/ //; $fields[DISCOUNT_PRICE] =~ y/ //d; $fields[3] =~ s/\s+//g; # Remove the space between the $ a +nd digit $fields[4] =~ s/\s+//g; # Remove the space between the $ a +nd digit $fields[8] =~ s/\s+//g; # Remove the space between the $ a +nd digit foreach (@fields){ $_ =~ s/\,//; } push @lines, [ $jacket_id, @fields ]; # Contractor Code Company Name + Bidders Name Phone Number print COMPANY $fields[1] . ",". $fields[2] . ",". $fields[ +9] . ",". $fields[11] . "\n"; # Title Quantity Contact Winning_C +ontractor print SOLICIT $jacket_id . ",,,,". $fields[1] . "\n"; if($fields[0] =~ /A/){ # Contractor Code +Date Received print AWARD $jacket_id . ",". $fields[1] . ",". $fiel +ds[10] . "\n"; } # Contractor Code Amount + Add'l Rate Add'l Rate's Per Discount Days +Discount % Discount Price print BID $jacket_id . ",". $fields[1] . ",". $fields[3] +. ",". $fields[4] . ",". $fields[5] . ",". $fields[7] . ",". $fields[ +6] . ",". $fields[8] . "\n"; print OUT $jacket_id . ",". $fields[0] . ",". $fields[1] +. ",". $fields[2] . ",". $fields[3] . ",". $fields[4] . ",". $fields[ +5] . ",". $fields[6] . ",". $fields[7] . ",". $fields[8] . ",". $fields[9] + . ",". $fields[10] . ",". $fields[11] . "\n"; # foreach my $field (@fields){ # print $field . ","; # } # print "\n"; } # print Dumper(\@lines); } } # End while () [download]	[reply] [d/l]
Re^3: Regular Expression to Parse Data from a PDF by kcott (Archbishop) on Feb 28, 2020 at 06:28 UTC
'They might add a "R-1" or "R-2" to the far left column if there is a revision.' You just need to extend the regex to handle that. Here's an example: `#!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my $jid = '777'; my $text = 'header 777 111 777 A 222 777R-1 333 777R-2 A 444'; my $re = qr{(?x: \A (R-\d+\|) \s? (A\|) \s (\d+) )}; my @lines; my $wanted_line = 0; for my $line (split /$jid/, $text) { next unless $wanted_line++; my @fields = $line =~ $re; push @lines, [ $jid . shift(@fields), @fields ]; } print Dumper(\@lines);` [download] Output: `$VAR1 = [ [ '777', '', '111' ], [ '777', 'A', '222' ], [ '777R-1', '', '333' ], [ '777R-2', 'A', '444' ] ];` [download] `print ... $fields[1] . ",". $fields[3] . ",". $fields[4] . ",". ...`* Here's an example to show a better way to handle that: `$ perl -e 'my @x = qw{a b c d e f}; print join ",", @x[0,3,4]' a,d,e` [download] On an unrelated note, there are problems with your open statements. Use of package variables can lead to all sorts of bugs that are hard to track down. Your six error messages are identical: how will you know which file generates "Can't open the output file ...". Look to using lexical filehandles and the 3-argument form of open. Consider the autodie pragma — you'll do less work and get better error reporting. — Ken	[reply] [d/l] [select]
Re^4: Regular Expression to Parse Data from a PDF by kevyt (Scribe) on Feb 28, 2020 at 06:52 UTC
Thanks Ken! I will read this in the morning. It's 1:51 AM here :) The goal is to help someone determine which bids they are losing and manually doing it is very time consuming. Thanks! Kevin	[reply]


We don't bite newbies here... much
	PerlMonks