kevyt has asked for the wisdom of the Perl Monks concerning the following question:
I have not been able to parse a few fields from a pdf file with CAM::PDF or regular expressions. Can someone offer help on how I may accomplish the task?
I noticed that CAM::PDF changes $100 to $ 100.
I was not able to split on \n so I split the line on the Id number in the far right column.
The column AWD is the company that won.
I would like to capture all of the columns except comments.
Here are two Example files:
https://contractorconnection.gpo.gov/abstract/746810
https://contractorconnection.gpo.gov/abstract/746819
Thanks
Kevin
#!/usr/bin/perl -w use warnings; use strict; use CAM::PDF; use LWP::Simple; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US', ); my $jacket_id = 746810; my $ua = LWP::UserAgent->new; # $ua->timeout(5); # Is the site available? my $response = $ua->get('https://contractorconnection.gpo.gov/abstract +/'. $jacket_id , @ns_headers); my $pdf = CAM::PDF->new($response->content) || die "$CAM::PDF::errstr\ +n"; # my $pdf = CAM::PDF->new('C:\dev\perl\file.pdf') || die "$CAM::PDF::e +rrstr\n"; # print $pdf->toString(); for my $page (1..$pdf->numPages){ my $text = $pdf->getPageText($page); my @lines = split (/$jacket_id\s+/, $text); # split on Jacket ID a +nd a space foreach (@lines) { print "\n$_\n"; if ( /^A/ ) { # A at the beginning of a line is the Award winn +er print $1; } if (/^(\d+\-)(\d+)/) { # Contractor Code print"Contractor code ". $1,$2 ."\n"; } if (/(\w+)\s+\$/ ) { # Does not work print"Name ". $1 ."\n"; # Name } # if (/\$?([0-9]{1,3},([0-9]{3},)*[0-9]{3}|[0-9]+)(\.[0-9][0-9] +)?$)/) { # Does not work # print"Amount ". $1 ."\n"; # Amount # } # if(1){ # Date # print "Date " . $1; # } } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regular Expression to Parse Data from a PDF
by kcott (Archbishop) on Feb 27, 2020 at 11:50 UTC | |
G'day Kevin, Firstly, I'm not a user of CAM::PDF; in fact, I didn't even have it installed. I suspect the getPageText() method is not the best choice for this: as you noted, you can't split lines easily and dollar amounts have an embedded space — I can't advise of a better choice; perhaps another monk can. I would strongly recommend that you do not write lengthy regexes the way you did in the last example in your code: they are incredibly difficult to read; even more difficult to maintain; and extremely error-prone. See my code below for a much better way to do this. Also, take a look at Regexp::Debugger: I find it very helpful and, in fact, used it to check some of the fiddlier parts of the regex in the code below. I've ignored the PDF download part of the code. You didn't ask about that: I'm assuming you've got that working satisfactorily. I just downloaded the two PDFs you referenced and accessed them from a local disk. Some notes on how I've dealt with lack of information: Here's the code:
Here's the first part of the output using your first example PDF:
Open the spoiler to see full output for both example PDFs.
— Ken | [reply] [d/l] [select] |
by kevyt (Scribe) on Feb 27, 2020 at 16:35 UTC | |
| [reply] |
by kevyt (Scribe) on Feb 28, 2020 at 02:25 UTC | |
Ken, Thanks very much for your help. It's working great but I forgot about one issue. They might add a "R-1" or "R-2" to the far left column if there is a revision. I have not used perl much since 2006 and I rarely used regex. I also tried to get some of the comments but that wont be importing going forward. Example with R-1 https://contractorconnection.gpo.gov/abstract/777292 Example without R-1 https://contractorconnection.gpo.gov/abstract/777293 I also need to install CAM::PDF so I can run it on linux.
| [reply] [d/l] |
by kcott (Archbishop) on Feb 28, 2020 at 06:28 UTC | |
'They might add a "R-1" or "R-2" to the far left column if there is a revision.' You just need to extend the regex to handle that. Here's an example:
Output:
print ... $fields[1] . ",". $fields[3] . ",". $fields[4] . ",". ... Here's an example to show a better way to handle that:
On an unrelated note, there are problems with your open statements. Use of package variables can lead to all sorts of bugs that are hard to track down. Your six error messages are identical: how will you know which file generates "Can't open the output file ...". Look to using lexical filehandles and the 3-argument form of open. Consider the autodie pragma — you'll do less work and get better error reporting. — Ken | [reply] [d/l] [select] |
by kevyt (Scribe) on Feb 28, 2020 at 06:52 UTC | |
|
Re: Regular Expression to Parse Data from a PDF
by vr (Curate) on Feb 27, 2020 at 12:13 UTC | |
(OT, not really Perl) That approach won't work, in general. Text extraction from PDF always involves some level of heuristics, especially with tables and/or formatting. CAM::PDF is very naive about extraction and is good for simple checks only, for limited subset of plain English. You may wish to take a look at CAM::PDF::getPageContent output:
In (very) simple English, what's inside parentheses is text content to show, what's in between (you guessed it) are positioning and formatting commands. And we are lucky that, in this trivial case, text has single-byte plain-ASCII encoding, so we can actually read it from source. If you scroll down, there are no space characters in parens. That's why, if we try to select and copy in Firefox, and paste into text editor, we'd get an ugly glued-together mess. So, the FF is even more naive about text extraction, than our CAM::PDF. The spaces appear to be present because of positioning of words. (Of course it's not always so, for all PDF's out there. Some use spaces. Some use kerning. Some use single text object (bracketed between BT/ET pair, as the whole page in your file) per each and every character. Thing to remember -- PDF is always a machine-gen stuff on long and familiar TIMTOWTDI leash, and intended to be consumed by machines. Better not worry nor ask too many "why?") CAM::PDF has spaces in its extracted text, -- even, as you noticed, where they should not be. It decided to play safe, but simple. Usually (not always...) text is split between text-showing operators (TJ and friends) into chunks not less than a word. So, if we want to join chunks on extraction and are lazy to analyze horizontal offsets, let's insert a space. (Actually, Adobe Reader is smart enough to add spaces where appropriate, for this file.) === OK, I'd try (and I did, in the past) to investigate xml produced by Ghostscript. See here. Mode "0" is low level, mode "1" tries heuristics to combine text chunks, but fails for your file, on quick and casual inspection, see further. (Note, I've seen GS "txtwrite" device to have issues/regressions in some releases, YMMV). Mode "0", apart from top "page" level, has "char" leaf nodes, with decoded character and calculated position (and also font/size) and intermediate, but actually atomic, "spans" (the "things in parens"). It's up to you, programmer, to decide if 2 adjacent spans are single word, or they are 2 words to be separated with a space, or (with tabular data) belong to different cells. Mode "1" tries to consolidate spans, adding spaces, but is not very good at it (see words glued together):
and also introduces "lines" and "blocks". Again, not too bright (halves of 2 cells in header row end up in one "block"):
I'd not use mode "1", but mode "0". Find spans containing your "jacket" string. Their vertical offsets are table rows boundaries. From your 2 files, columns have constant offsets. From here you should have an idea how to find individual cells content. | [reply] [d/l] [select] |
by kevyt (Scribe) on Feb 27, 2020 at 15:46 UTC | |
| [reply] [d/l] [select] |
|
Re: Regular Expression to Parse Data from a PDF
by LanX (Saint) on Feb 27, 2020 at 10:34 UTC | |
If you want help with regex, then you should better show us the input strings and the desired results. See also SSCCE On a side note: I'm personally using pdftohtml -xml to parse pdf.
Cheers Rolf | [reply] [d/l] |
by kevyt (Scribe) on Feb 27, 2020 at 15:30 UTC | |
| [reply] |
|
Re: Regular Expression to Parse Data from a PDF
by brostad (Monk) on Feb 27, 2020 at 12:49 UTC | |
This works for all the rows (except headline) on https://contractorconnection.gpo.gov/abstract/746810 | [reply] [d/l] |
|
Re: Regular Expression to Parse Data from a PDF
by Fletch (Bishop) on Feb 27, 2020 at 18:44 UTC | |
Another possibility to try: if you're on a linux-y system and have the poppler package available which has a pdftotext (RHEL has it in its 'poppler-utils' RPM) that might work for you. Open a pipe from something like pdftotext -layout foo.pdf - and see if that gets what you need from your PDF for your purposes.
The cake is a lie. | [reply] [d/l] [select] |
by LanX (Saint) on Feb 27, 2020 at 21:04 UTC | |
Not really, pdftohtml -xml is far better, see Parsing PDFs by text position?
Cheers Rolf | [reply] [d/l] |