Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a 5000 pages PDF-monster which I have to parse. I need only the pages with the special text in the header. Could you give me a hint, whether I can do it with Perl (with what module?). Many thanks! WD

Replies are listed 'Best First'.
Re: Parsing PDF header (footer)
by ww (Archbishop) on Jun 09, 2011 at 12:07 UTC
      Well... actually I tried this earlier and now, but without success. I would be grateful if you could give me a hint what module could do the job. You should not write a code for me, just help me with the direction. Thanks.

        What did you try, exactly?

        Also Super Search will turn up many discussions relating to PDF documents. Maybe they are helpful to you?

Re: Parsing PDF header (footer)
by wind (Priest) on Jun 09, 2011 at 20:30 UTC
    Try CAM::PDF.
    use CAM::PDF; use strict; use warnings; my $filename = 'foo.pdf'; my $pdf = CAM::PDF->new($filename); for my $pagenum (1..$pdf->numPages()) { my $text = $pdf->getPageText(pagenum); if ($text =~ /looking for text/) { print "Found on $pagenum\n"; } }
      Thank you very much! Actually I must select according to the text in the header, not just the page nummer. But now I have a direction. Thanks again!