Hello diamondsandperls,

There are a few problems in your updated code:

(1) The regexes won’t work as you want. For example, in a regex a single dot matches any character (except newline). To get the literal dots in an IP address, you must backslash them: \. And for the URI and HOST, you want the capture to end at the first whitespace, so use \S+

(2) No need to re-open the PDF file each time through the foreach loop.

(3) As bulk88 pointed out, having created an object ($doc), you should call an instance method on it: $doc->parseAny($pdfString);

(4) However, I’m not sure if that’s the method you want. From the module’s documentation, it appears getPageText might be the right choice.

Applying these fixes to your code (and assuming the PDF document contains only 1 page):

#! perl use strict; use warnings; use CAM::PDF; my $filename = 'view1.pdf'; my $output_file = 'test.txt'; my @pdfStrings = ( qr/Source IP:\s*(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1 +,3})/, qr/Request URI:\s*(\S+)/, qr/HOST:\s*(\S+)/, ); my $pdf = CAM::PDF->new($filename) or die "Cannot open '$filename' as a PDF file: $!"; my $doc = $pdf->getPageText(1); open(my $output_fh, '>', $output_file) or die "Failed to open file '$output_file' for writing: $!"; foreach my $search_string (@pdfStrings) { my ($find) = $doc =~ /$search_string/; print $output_fh $find, "\n" if $find; } close($output_fh) or die "Failed to close file '$output_file': $!";

This is supposed to work. However, when I create a test PDF file using Word, I find that $pdf->getPageText(1) returns a string containing the text of the PDF file but with extra newlines inserted. (I cannot see any reason for this.) And these newlines can cause the regexes to fail. :-( But if your input PDF files are created differently, perhaps they won’t give rise to this problem?

HTH,

Athanasius <°(((><contra mundum


In reply to Re^3: parsing a pdf with CAM::PDF by Athanasius
in thread parsing a pdf with CAM::PDF by diamondsandperls

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.