I have not been able to parse a few fields from a pdf file with CAM::PDF or regular expressions. Can someone offer help on how I may accomplish the task?
I noticed that CAM::PDF changes $100 to $ 100.
I was not able to split on \n so I split the line on the Id number in the far right column.
The column AWD is the company that won.
I would like to capture all of the columns except comments.
Here are two Example files:
https://contractorconnection.gpo.gov/abstract/746810
https://contractorconnection.gpo.gov/abstract/746819
Thanks
Kevin
#!/usr/bin/perl -w
use warnings;
use strict;
use CAM::PDF;
use LWP::Simple;
my @ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
my $jacket_id = 746810;
my $ua = LWP::UserAgent->new;
# $ua->timeout(5); # Is the site available?
my $response = $ua->get('https://contractorconnection.gpo.gov/abstract
+/'. $jacket_id , @ns_headers);
my $pdf = CAM::PDF->new($response->content) || die "$CAM::PDF::errstr\
+n";
# my $pdf = CAM::PDF->new('C:\dev\perl\file.pdf') || die "$CAM::PDF::e
+rrstr\n";
# print $pdf->toString();
for my $page (1..$pdf->numPages){
my $text = $pdf->getPageText($page);
my @lines = split (/$jacket_id\s+/, $text); # split on Jacket ID a
+nd a space
foreach (@lines) {
print "\n$_\n";
if ( /^A/ ) { # A at the beginning of a line is the Award winn
+er
print $1;
}
if (/^(\d+\-)(\d+)/) { # Contractor Code
print"Contractor code ". $1,$2 ."\n";
}
if (/(\w+)\s+\$/ ) { # Does not work
print"Name ". $1 ."\n"; # Name
}
# if (/\$?([0-9]{1,3},([0-9]{3},)*[0-9]{3}|[0-9]+)(\.[0-9][0-9]
+)?$)/) { # Does not work
# print"Amount ". $1 ."\n"; # Amount
# }
# if(1){ # Date
# print "Date " . $1;
# }
}
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.