pankaj_it09 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse PDF files as below :

my ($fileName) = @_; unless(CAM::PDF->new($fileName)) { return; } my $pdf = CAM::PDF->new($fileName); my $numOfPages = $pdf->numPages(); for(my $i=1; $i<=$numOfPages;$i++) { $text = $pdf->getPageText($i); } $pdf->cleanse();

The following is the error :
Expected stream open tag 20 stream^Mq^Mq^M323 0 0 188 0 0 cm^M /I1 Do^MQ^M...

Some PDF files are getting parsed but some are not.

Replies are listed 'Best First'.
Re: Parsing PDF file
by Corion (Patriarch) on Nov 19, 2008 at 08:42 UTC

    My guess is that the PDF files are broken. The ^M displayed in the error message could indicate that the file was transferred via ftp using ASCII mode and that broke the PDF files. There is nothing Perl can do to help.

Re: Parsing PDF file
by moritz (Cardinal) on Nov 19, 2008 at 08:41 UTC
    Have you checked that the PDFs are actually sane (ie not damaged)? Can other programs open them?

    And have you read section COMPATIBILITY of the manual? Is the version of your PDF file supported?

Re: Parsing PDF file
by pankaj_it09 (Scribe) on Nov 16, 2009 at 13:37 UTC
    Is it required to destroy the PDF object ?

    Which one of the below routines to use to destroy the PDF object :-

    $doc->getValue($object)
    For INTERNAL use Dereference a data object, return a value. Given an node object of any kind, returns raw scalar object: hashref, arrayref, string, number. This function follows all references, and descends into all objects.

    $doc->getObjValue($objectnum)
    For INTERNAL use Dereference a data object, and return a value. Behaves just like the getValue() function, but used when all you know is the object number.

    $doc->dereference($objectnum)

    $doc->dereference($name, $pagenum)
    For INTERNAL use Dereference a data object, return a PDF object as a node. This function makes heavy use of the internal object cache. Most (if not all) object requests should go through this function. $name should look something like '/R12'.