hurix_03 has asked for the wisdom of the Perl Monks concerning the following question:

How can i find the corrupted pdf files in perl

I have used the coding below, can any one correct me, The program loops only once and Exits, How can i get the error report

use CAM::PDF; use PDF; undef $/; my $path =$ARGV[0]; opendir(DIR, $path) || die $!; @all = grep/\.pdf/, readdir(DIR); closedir (DIR); foreach $aa (@all) { my $pdf = PDF->new($aa); if ($pdf->IsaPDF) { print "is a pdf file\n" } else { print "pdf $aa is corrupted\n"; } }

Replies are listed 'Best First'.
Re: I have a problem in finding corrupted PDF files
by shmem (Chancellor) on Dec 10, 2007 at 13:07 UTC
    How can i get the error report

    Saying

    use strict; use warnings;

    is always a good idea. What happens if PDF->new($aa) fails and doesn't return an object? The IsaPDF method might not be applicable on $pdf in that case.

    Maybe the directory contains only one pdf file? You could check the contents of your @all array with

    print "\@all = (",join(',',map { "'$_'" } @all),")\n";

    before entering the loop.

    Did your script produce any output? If so, post it. And why did you undef $/ ?

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: I have a problem in finding corrupted PDF files
by almut (Canon) on Dec 10, 2007 at 15:05 UTC

    Your usage is basically correct, but there are several things to note:  The filenames in @all (as returned by readdir) won't include any directory components of the path, so unless your $ARGV[0] always is '.', you probably want

    my $pdf = PDF->new("$path/$aa");

    Next, and more importantly, PDF->new() does die in some cases, which will make the entire script terminate... To work around this, you need to eval { ... } that call:

    foreach my $aa (@all) { my $pdf; eval { $pdf = PDF->new("$path/$aa") }; if (!$@ && $pdf->IsaPDF) { print "$aa is a pdf file\n" } else { print "$aa is corrupted\n"; } }

    Lastly, I'm not sure at all if the very simple method used to identify PDF files, i.e.

    # from PDF/Parse.pm sub IsaPDF { return ($_[0]->{Header} != undef) ; }

    with Header being extracted as the version component of the basic PDF header (e.g. "%PDF-1.6" --> "1.6"), would actually identify more subtle corruptness in the file. So, unless the problem is such that it would make the parser die, the error would likely go unnoticed...

      Thanks for your Reply. But still i'm getting the following errors.

      1) Bad object reference in the line'. ("$path/$aa"). 2)If the pdf file is corrupted one, it shows the error "Can't read cross-reference section, according to trailer" and terminate the program.

      I want to get a corrupted pdf file list.

        If the pdf file is corrupted one, it shows the error "Can't read cross-reference section, according to trailer" and terminate the program.

        This is exactly one of the cases I had in mind when I said "PDF->new() does die in some cases". As this is a simple, straightforward die message in PDF/Parse.pm (line 82),

        sub ReadCrossReference_pass1 { ... $_=PDF::Core::PDFGetline ($fd,\$offset); die "Can't read cross-reference section, according to trailer\n" if +! /xref\r?\n?/ ; ... }

        I'm pretty sure (actually, I've tried it) that it will be caught, if you wrap the call in an eval block with curly braces, as shown in my previous reply. That's exactly what that eval BLOCK form (Perl's exception handling mechanism) is for.

        If you still can't keep the script running in that case, please post the exact code you're using.

        Hi: You can try a popular PDF file recovery tool called Advanced PDF Repair to repair your PDF file. It is a powerful tool to repair corrupt or damaged PDF files. Detailed information about Advanced PDF Repair can be found at http://www.datanumen.com/apdfr/ And you can also download a free demo version at http://www.datanumen.com/apdfr/apdfr.exe Alan