in reply to Extracting data from a PDF to a spreadsheet

You'll probably want to first convert the PDF files to text; I like to use pdftotext. You can call it from perl with the system command if you like (once it's installed, that is). Then you'll want to open the files, read from them, and maybe use split or regular expressions to parse the data. You can use Spreadsheet::WriteExcel to create Excel spreadsheets, or Text::CSV (or Text::CSV_XS) to create csv files (that some people think are spreadsheets anyway).

Hope that helps, and welcome to PerlMonks. Remember that we are not a code writing service, but if you show some code and have a specific problem, we can help you with it.

  • Comment on Re: Extracting data from a PDF to a spreadsheet

Replies are listed 'Best First'.
Re^2: Extracting data from a PDF to a spreadsheet
by Anonymous Monk on Jun 22, 2011 at 16:15 UTC

    Thanks for the guidance. I do have some code, but I didn't figure it would help to see it if I didn't provide the file it's trying to extract data from. But I'll place it here anyway. <\p>

    while(<STDIN>) { @section = split /Class: Invoice/, $_; @AdminData = split /\n/, $section[0]; @BodyTemp = split /Administrative Data:/, $_; @Body = split /Reply: click here/, $BodyTemp[0]; @Splitterhold = split/Payment Detail - Payment ID /, $_; foreach $Splitterhold(@Splitterhold) { $Splitterhold =~ s/InvoiceDate /Invoice Dateĉ /g; $Splitterhold =~ s/Customer ID /CustomerIDĉ /g; $Splitterhold =~ s/^Phone /Phoneĉ /g; $Splitterhold =~ s/Txn Type Post Day Amount \(USD\)\n/InvoiceDateĉ + /g; $Splitterhold =~ s/Card Type Card Number Exp Date BIN\n/CreditCard +ĉ /g; $Splitterhold =~ s/Name /Nameĉ /g; $Splitterhold =~ s/Address Line 1 /Addressĉ /g; $Splitterhold =~ s/City /Cityĉ /g; $Splitterhold =~ s/State /Stateĉ /g; $Splitterhold =~ s/Email Address /EmailAddressĉ /g; $Splitterhold =~ s/Home phone number /Homephonenumberĉ /g; $Splitterhold =~ s/Last modified on /Lastmodifiedonĉ /g; } #@sector = split /Payment Detail -/, $section[1], /administration +>/; if ($#Splitterhold > 0) { for ($x = 0; $x < $#Splitterhold; $x++) { @Split = split/\n/, $Splitterhold[$x]; @parse = split /ĉ/, @Split; if ($#parse > 0) { $parse[0] =~ s/\W//g; $parse[1] =~ s/\-//g; @AO{$parse[0]} = $parse[1]; } if ($#parsezero > 0) { $parsezero[1]=~ s/\-//g; $IV{$parsezero[0]} = $parsezero[1]; @IVone = push (@IV, @IV); print $IV; } } } $Body[1] =~ s/^one$/1/gi; $Body[1] =~ s/^two$/2/gi; $Body[1] =~ s/^three$/3/gi; $Body[1] =~ s/^four$/4/gi; $Body[1] =~ s/^five$/5/gi; $Body[1] =~ s/^six$/6/gi; $Body[1] =~ s/^seven$/7/gi; $Body[1] =~ s/^eight$/8/gi; $Body[1] =~ s/^nine$/9/gi; $Body[1] =~ s/^zero$/0/gi; $Body[1] =~ s/0ne/1/gi; @PostingBody = split/\n/, $Body[1]; for ($x = 0; $x < $#PostingBody; $x++) { $PostingBody[$x] =~ s/\s//gi; $PostingBody[$x] =~ s/\W//gi; $MC = NULL; if ($PostingBody[$x] =~ m/\d{3}.*\d{3}.*\d{4}/) { $PostingBody[$x] =~ s/\D//gi; $PostingBody[$x] =~ s/\W//g; $MC{'Digits'} = $PostingBody[$x]; } } @elements=('Digits'); for($x=0; $x< @elements; $x++) { print ($MC{$elements[$x]}."\t\t"); $MC = ""; } @elements=("PostID","Location","posted","Reply","Postersage","Part +ner", "AdType","PaidAd","AdPrice","Whitelisted","Name","Phone","Email"," +UserCreated","Settings", "Referrer","IP","AdCreated"); for($x=0; $x< @elements; $x++) { print(@AO{$elements[$x]}."\t"); $AO = ""; } @elements=("Lastmodifiedon", "InvoiceDate", "CreditCard", "Name", +"Address", "City", "State", "EmailAddress", "Homephonenumber", "Custo +merID"); for($x=0; $x< @elements; $x++) { print (@IVone{$elements[$x]}."\t"); $IV = ""; } { print "\n"; } }
      You can supply a bit of data for posting in a self contained example by using the DATA handle, e.g.:
      while (<DATA>) { print "Got: $_"; } __END__ one two three
      Try to post the minimum amount of code and data that demonstrates the problem you're having (and fix your closing code tag).