Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Extracting data from a PDF to a spreadsheet

by NonProgrammer (Initiate)
on Jun 21, 2011 at 23:19 UTC ( [id://910830]=perlquestion: print w/replies, xml ) Need Help??

NonProgrammer has asked for the wisdom of the Perl Monks concerning the following question:

So I'm new to Perl, and somewhat new to programming. I see the power of Perl and how it can save me hours and hours of work. I've been trying to write some code but, I'm having trouble getting started. I have some searchable PDF files containing some structured information from a database I've been able to extract. The info is broken up nicely and I can split it order by order. My problem is that each order has different number of invoices associated with it. What I need to do is extract information from each order, then extract information from each invoice, then print it out to a spreadsheet. Here's a sample of how my data is structured:

Orders Number 1 ... some stuff... _________ Class: Invoice 1 ...some stuff... __________ Payment Detail - Payment ID 1 ...some stuff... _________

Orders Number 2 ... some stuff... ________ Class: Invoice 1 some stuff... ___________ Payment Detail - Payment ID 1 ...some stuff... Payment Detail - Payment ID 2 ...some stuff... Payment Detail - Payment ID 3 ...some stuff... ________

If anyone can help, it would be greatly appreciated.

  • Comment on Extracting data from a PDF to a spreadsheet

Replies are listed 'Best First'.
Re: Extracting data from a PDF to a spreadsheet
by runrig (Abbot) on Jun 21, 2011 at 23:37 UTC

    You'll probably want to first convert the PDF files to text; I like to use pdftotext. You can call it from perl with the system command if you like (once it's installed, that is). Then you'll want to open the files, read from them, and maybe use split or regular expressions to parse the data. You can use Spreadsheet::WriteExcel to create Excel spreadsheets, or Text::CSV (or Text::CSV_XS) to create csv files (that some people think are spreadsheets anyway).

    Hope that helps, and welcome to PerlMonks. Remember that we are not a code writing service, but if you show some code and have a specific problem, we can help you with it.

      Thanks for the guidance. I do have some code, but I didn't figure it would help to see it if I didn't provide the file it's trying to extract data from. But I'll place it here anyway. <\p>

      while(<STDIN>) { @section = split /Class: Invoice/, $_; @AdminData = split /\n/, $section[0]; @BodyTemp = split /Administrative Data:/, $_; @Body = split /Reply: click here/, $BodyTemp[0]; @Splitterhold = split/Payment Detail - Payment ID /, $_; foreach $Splitterhold(@Splitterhold) { $Splitterhold =~ s/InvoiceDate /Invoice Dateĉ /g; $Splitterhold =~ s/Customer ID /CustomerIDĉ /g; $Splitterhold =~ s/^Phone /Phoneĉ /g; $Splitterhold =~ s/Txn Type Post Day Amount \(USD\)\n/InvoiceDateĉ + /g; $Splitterhold =~ s/Card Type Card Number Exp Date BIN\n/CreditCard +ĉ /g; $Splitterhold =~ s/Name /Nameĉ /g; $Splitterhold =~ s/Address Line 1 /Addressĉ /g; $Splitterhold =~ s/City /Cityĉ /g; $Splitterhold =~ s/State /Stateĉ /g; $Splitterhold =~ s/Email Address /EmailAddressĉ /g; $Splitterhold =~ s/Home phone number /Homephonenumberĉ /g; $Splitterhold =~ s/Last modified on /Lastmodifiedonĉ /g; } #@sector = split /Payment Detail -/, $section[1], /administration +>/; if ($#Splitterhold > 0) { for ($x = 0; $x < $#Splitterhold; $x++) { @Split = split/\n/, $Splitterhold[$x]; @parse = split /ĉ/, @Split; if ($#parse > 0) { $parse[0] =~ s/\W//g; $parse[1] =~ s/\-//g; @AO{$parse[0]} = $parse[1]; } if ($#parsezero > 0) { $parsezero[1]=~ s/\-//g; $IV{$parsezero[0]} = $parsezero[1]; @IVone = push (@IV, @IV); print $IV; } } } $Body[1] =~ s/^one$/1/gi; $Body[1] =~ s/^two$/2/gi; $Body[1] =~ s/^three$/3/gi; $Body[1] =~ s/^four$/4/gi; $Body[1] =~ s/^five$/5/gi; $Body[1] =~ s/^six$/6/gi; $Body[1] =~ s/^seven$/7/gi; $Body[1] =~ s/^eight$/8/gi; $Body[1] =~ s/^nine$/9/gi; $Body[1] =~ s/^zero$/0/gi; $Body[1] =~ s/0ne/1/gi; @PostingBody = split/\n/, $Body[1]; for ($x = 0; $x < $#PostingBody; $x++) { $PostingBody[$x] =~ s/\s//gi; $PostingBody[$x] =~ s/\W//gi; $MC = NULL; if ($PostingBody[$x] =~ m/\d{3}.*\d{3}.*\d{4}/) { $PostingBody[$x] =~ s/\D//gi; $PostingBody[$x] =~ s/\W//g; $MC{'Digits'} = $PostingBody[$x]; } } @elements=('Digits'); for($x=0; $x< @elements; $x++) { print ($MC{$elements[$x]}."\t\t"); $MC = ""; } @elements=("PostID","Location","posted","Reply","Postersage","Part +ner", "AdType","PaidAd","AdPrice","Whitelisted","Name","Phone","Email"," +UserCreated","Settings", "Referrer","IP","AdCreated"); for($x=0; $x< @elements; $x++) { print(@AO{$elements[$x]}."\t"); $AO = ""; } @elements=("Lastmodifiedon", "InvoiceDate", "CreditCard", "Name", +"Address", "City", "State", "EmailAddress", "Homephonenumber", "Custo +merID"); for($x=0; $x< @elements; $x++) { print (@IVone{$elements[$x]}."\t"); $IV = ""; } { print "\n"; } }
        You can supply a bit of data for posting in a self contained example by using the DATA handle, e.g.:
        while (<DATA>) { print "Got: $_"; } __END__ one two three
        Try to post the minimum amount of code and data that demonstrates the problem you're having (and fix your closing code tag).
Re: Extracting data from a PDF to a spreadsheet
by wind (Priest) on Jun 21, 2011 at 23:28 UTC
Re: Extracting data from a PDF to a spreadsheet
by 7stud (Deacon) on Jun 22, 2011 at 01:05 UTC
    My problem is that each order has different number of invoices associated with it.
    The data you posted has one invoice per order.
      I guess it does. Let's say that one order had 1 invoice and another had 4 invoices. How would I go about capturing the information from each if I they're not the same number of invoices?
Re: Extracting data from a PDF to a spreadsheet
by LanX (Saint) on Jun 23, 2011 at 01:36 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://910830]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-16 23:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found