in reply to Re^4: blank pdf generated using PDF::API2 (Updated)
in thread blank pdf generated using PDF::API2

I'd like to provide the pdf file ... I don't really know how ...

How about going the other way? What happens when you run your code (or, indeed, the other monks' code) against some 100+ page document they seem to be having success with, e.g., Modern Perl?


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^6: blank pdf generated using PDF::API2 (Updated)
by lennelei (Acolyte) on Jul 21, 2017 at 11:25 UTC

    First of all, I'm sorry if everything is not clear: English is not my native language.

    If you want the full story, here it is (you can skip that part :). We (my company) have a specific OCR software to handle PDF bills. I made a Perl script that extract pdf files from emails or retrieve them from our MFT software, rename them for normalization. As I learned that our OCR software doesn't work well with big pdf, I tried to add some code to the script to check if a file is more than 100 pages and, in that case, keeps only the first 100 pages (and drop the rest).

    As I didn't want to bother you with all the details, I only keep the part that cut the pdf in my first post.

    To resume: for any pdf, I need to keep at most the 100 first pages (if the pdf is 15 pages, I leave it untouched ; if it's 654 pages, I create a new pdf with the pages 1 to 100 included).

    ---- End of the story ----

    Once again, my script is working (99.9% of the time): my problem is not how to write it but why did it fails for one (only one) pdf and what can I do (if I can do something)!

    I didn't try the script against the "Modern Perl" file because unfortunately, I don't have it (yet), but I have lot of 100+ pages pdf (up to 600 pages) and they are all (but one) correctly processed by my script.

    I would like to provide you this specific pdf which has probably something that prevents PDF::API2 to process it correctly but I cannot as it contains customers information (I'm looking for a way to obfuscate the content).

    What's strange is that I managed to extract 100 pages from this specific pdf using sejda or CAM::PDF and the extractPages method.

    But with PDF::API2, it's not working.

      Try this program against your problem pdf. What version of PDF::API2 do you have ?

      #!/usr/bin/perl use strict; use warnings; use PDF::API2; my $file = 'some.pdf'; my $pdf = PDF::API2->open($file); my $pages = $pdf->pages(); printf "PDF Version : %s\n",$pdf->version(); printf "Pages : %s\n",$pdf->pages(); for my $n (1..$pages){ my $page = $pdf->openpage($n); printf "Page %3d Media %5.2f %5.2f %5.2f %5.2f\n",$n,$page->get_medi +abox; }
      poj

        Hi,

        Before giving you the answers, I may have an idea: it seems that the problematic pdf is encrypted. I think that PDF::API2 doesn't work because it tries to copy some sort raw content from an encrypted pdf to a non encrypted file (which produces blank pages because it's incorrect data). CAM::PDF might work because it starts with the original PDF and then remove the unwanted pages leaving the file encrypted (I presume that sejda either do the same or first decipher the content before copying it).

        PDF-API2 folder in my Strawberry Perl installation gives 2.033:

        D:\Perl\cpan\build\PDF-API2-2.033-ze3hij\lib\PDF\API2.pm

        Here is the result of your script with juste a printf added on line 11 to check the encryption (printf "isEncrypted : %s\n",$pdf->isEncrypted();) :

        PDF Version : 1.3 Pages : 540 isEncrypted : 1 Page 1 Media 0.00 0.00 595.00 864.00 Page 2 Media 0.00 0.00 595.00 864.00 Page 3 Media 0.00 0.00 595.00 864.00 Page 4 Media 0.00 0.00 595.00 864.00 Page 5 Media 0.00 0.00 595.00 864.00 Page 6 Media 0.00 0.00 595.00 864.00 Page 7 Media 0.00 0.00 595.00 864.00 Page 8 Media 0.00 0.00 595.00 864.00 Page 9 Media 0.00 0.00 595.00 864.00 Page 10 Media 0.00 0.00 595.00 864.00 Page 11 Media 0.00 0.00 595.00 864.00 Page 12 Media 0.00 0.00 595.00 864.00 ...an so on until page 540 (values are exactly the same)

      Hello again lennelei,

      This works as expected, based on your last update (To resume: for any pdf, I need to keep at most the 100 first pages (if the pdf is 15 pages, I leave it untouched ; if it's 654 pages, I create a new pdf with the pages 1 to 100 included).).

      It creates a new pdf (100 pages) if the pdf is (above 100 pages).

      #!/usr/bin/perl use strict; use warnings; use PDF::API2; my $file='test.pdf'; my $newpdf = PDF::API2->new(); my $oldpdf = PDF::API2->open($file); if ($oldpdf->pages() > 100) { printf " (%d pages)\n", $oldpdf->pages(); for my $page_nb (1..100) { $newpdf->importpage($oldpdf, $page_nb, $page_nb); } $newpdf->saveas("_".$file); }

      Hope this helps, BR.

      Seeking for Perl wisdom...on the process of learning...not there...yet!
        Hi, I know this works as expected :s but not for one given file ! After lots of testing, I think it might be because the problem pdf is password protected (probably from modifications as nothing is asked to read the file).
      ... I have lot of 100+ pages pdf (up to 600 pages) and they are all (but one) correctly processed by my script.

      Ok, I understand better now. I had thought you were having problems with 100+ page PDFs in general.


      Give a man a fish:  <%-{-{-{-<

        No, only with one file :(

        Thank you all for all your message and your patience, it's sometime hard to explain problems online :)