Re^3: parsing a pdf with CAM::PDF

There are a few problems in your updated code:

(1) The regexes won’t work as you want. For example, in a regex a single dot matches any character (except newline). To get the literal dots in an IP address, you must backslash them: \. And for the URI and HOST, you want the capture to end at the first whitespace, so use \S+

(2) No need to re-open the PDF file each time through the foreach loop.

(3) As bulk88 pointed out, having created an object ($doc), you should call an instance method on it: $doc->parseAny($pdfString);

(4) However, I’m not sure if that’s the method you want. From the module’s documentation, it appears getPageText might be the right choice.

Applying these fixes to your code (and assuming the PDF document contains only 1 page):

#! perl
use strict;
use warnings;
use CAM::PDF;

my $filename    = 'view1.pdf';
my $output_file = 'test.txt';
my @pdfStrings  = (
                      qr/Source IP:\s*(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1
+,3})/,
                      qr/Request URI:\s*(\S+)/,
                      qr/HOST:\s*(\S+)/,
                  );

my $pdf = CAM::PDF->new($filename)
    or die "Cannot open '$filename' as a PDF file: $!";
my $doc = $pdf->getPageText(1);

open(my $output_fh, '>', $output_file)
    or die "Failed to open file '$output_file' for writing: $!";

foreach my $search_string (@pdfStrings)
{
    my ($find) = $doc =~ /$search_string/;

    print $output_fh $find, "\n" if $find;
}

close($output_fh)
    or die "Failed to close file '$output_file': $!";
[download]

This is supposed to work. However, when I create a test PDF file using Word, I find that $pdf->getPageText(1) returns a string containing the text of the PDF file but with extra newlines inserted. (I cannot see any reason for this.) And these newlines can cause the regexes to fail. :-( But if your input PDF files are created differently, perhaps they won’t give rise to this problem?

HTH,

Athanasius <°(((>< contra mundum

Comment on Re^3: parsing a pdf with CAM::PDF Download Code