Re^5: blank pdf generated using PDF::API2 (Updated)

Replies are listed 'Best First'.
Re^6: blank pdf generated using PDF::API2 (Updated) by lennelei (Acolyte) on Jul 21, 2017 at 11:25 UTC
First of all, I'm sorry if everything is not clear: English is not my native language. If you want the full story, here it is (you can skip that part :). We (my company) have a specific OCR software to handle PDF bills. I made a Perl script that extract pdf files from emails or retrieve them from our MFT software, rename them for normalization. As I learned that our OCR software doesn't work well with big pdf, I tried to add some code to the script to check if a file is more than 100 pages and, in that case, keeps only the first 100 pages (and drop the rest). As I didn't want to bother you with all the details, I only keep the part that cut the pdf in my first post. To resume: for any pdf, I need to keep at most the 100 first pages (if the pdf is 15 pages, I leave it untouched ; if it's 654 pages, I create a new pdf with the pages 1 to 100 included). ---- End of the story ---- Once again, my script is working (99.9% of the time): my problem is not how to write it but why did it fails for one (only one) pdf and what can I do (if I can do something)! I didn't try the script against the "Modern Perl" file because unfortunately, I don't have it (yet), but I have lot of 100+ pages pdf (up to 600 pages) and they are all (but one) correctly processed by my script. I would like to provide you this specific pdf which has probably something that prevents PDF::API2 to process it correctly but I cannot as it contains customers information (I'm looking for a way to obfuscate the content). What's strange is that I managed to extract 100 pages from this specific pdf using sejda or CAM::PDF and the `extractPages` method. But with PDF::API2, it's not working.	[reply] [d/l]
Re^7: blank pdf generated using PDF::API2 (Updated) by poj (Abbot) on Jul 21, 2017 at 11:37 UTC
Try this program against your problem pdf. What version of PDF::API2 do you have ? `#!/usr/bin/perl use strict; use warnings; use PDF::API2; my $file = 'some.pdf'; my $pdf = PDF::API2->open($file); my $pages = $pdf->pages(); printf "PDF Version : %s\n",$pdf->version(); printf "Pages : %s\n",$pdf->pages(); for my $n (1..$pages){ my $page = $pdf->openpage($n); printf "Page %3d Media %5.2f %5.2f %5.2f %5.2f\n",$n,$page->get_medi +abox; }` [download] poj	[reply] [d/l]
Re^8: blank pdf generated using PDF::API2 (Updated) by lennelei (Acolyte) on Jul 21, 2017 at 13:26 UTC
Hi, Before giving you the answers, I may have an idea: it seems that the problematic pdf is encrypted. I think that PDF::API2 doesn't work because it tries to copy some sort raw content from an encrypted pdf to a non encrypted file (which produces blank pages because it's incorrect data). CAM::PDF might work because it starts with the original PDF and then remove the unwanted pages leaving the file encrypted (I presume that sejda either do the same or first decipher the content before copying it). PDF-API2 folder in my Strawberry Perl installation gives 2.033: `D:\Perl\cpan\build\PDF-API2-2.033-ze3hij\lib\PDF\API2.pm` Here is the result of your script with juste a printf added on line 11 to check the encryption (`printf "isEncrypted : %s\n",$pdf->isEncrypted();`) : PDF Version : 1.3 Pages : 540 isEncrypted : 1 Page 1 Media 0.00 0.00 595.00 864.00 Page 2 Media 0.00 0.00 595.00 864.00 Page 3 Media 0.00 0.00 595.00 864.00 Page 4 Media 0.00 0.00 595.00 864.00 Page 5 Media 0.00 0.00 595.00 864.00 Page 6 Media 0.00 0.00 595.00 864.00 Page 7 Media 0.00 0.00 595.00 864.00 Page 8 Media 0.00 0.00 595.00 864.00 Page 9 Media 0.00 0.00 595.00 864.00 Page 10 Media 0.00 0.00 595.00 864.00 Page 11 Media 0.00 0.00 595.00 864.00 Page 12 Media 0.00 0.00 595.00 864.00 ...an so on until page 540 (values are exactly the same) [download]	[reply] [d/l] [select]
Re^9: blank pdf generated using PDF::API2 (Updated) by poj (Abbot) on Jul 21, 2017 at 15:38 UTC
Re^7: blank pdf generated using PDF::API2 (Updated) by thanos1983 (Parson) on Jul 21, 2017 at 12:19 UTC
Hello again lennelei, This works as expected, based on your last update (*To resume: for any pdf, I need to keep at most the 100 first pages (if the pdf is 15 pages, I leave it untouched ; if it's 654 pages, I create a new pdf with the pages 1 to 100 included).*). It creates a new pdf (100 pages) if the pdf is (above 100 pages). `#!/usr/bin/perl use strict; use warnings; use PDF::API2; my $file='test.pdf'; my $newpdf = PDF::API2->new(); my $oldpdf = PDF::API2->open($file); if ($oldpdf->pages() > 100) { printf " (%d pages)\n", $oldpdf->pages(); for my $page_nb (1..100) { $newpdf->importpage($oldpdf, $page_nb, $page_nb); } $newpdf->saveas("_".$file); }` [download] Hope this helps, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^8: blank pdf generated using PDF::API2 (Updated) by lennelei (Acolyte) on Jul 21, 2017 at 13:30 UTC
Hi, I know this works as expected :s but not for one given file ! After lots of testing, I think it might be because the problem pdf is password protected (probably from modifications as nothing is asked to read the file).	[reply]
Re^9: blank pdf generated using PDF::API2 (Updated) by thanos1983 (Parson) on Jul 21, 2017 at 14:11 UTC
Re^10: blank pdf generated using PDF::API2 (Updated) by lennelei (Acolyte) on Jul 21, 2017 at 14:42 UTC
Some notes below your chosen depth have not been shown here
Re^7: blank pdf generated using PDF::API2 (Updated) by AnomalousMonk (Archbishop) on Jul 21, 2017 at 11:59 UTC
... I have lot of 100+ pages pdf (up to 600 pages) and they are all (but one) correctly processed by my script. Ok, I understand better now. I had thought you were having problems with 100+ page PDFs in general. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^8: blank pdf generated using PDF::API2 (Updated) by lennelei (Acolyte) on Jul 21, 2017 at 14:01 UTC
No, only with one file :( Thank you all for all your message and your patience, it's sometime hard to explain problems online :)	[reply]