Pharazon has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am relatively new to programming in perl and am learning in a sort of trial by fire kind of way since I'm working directly on projects for work. My current problem is that I have a very large number of pdfs (22,000 +) of varying numbers of pages. I need to take those pdfs and combine them into single files split by a few groups. For example my groups will be one page, two page, three and four page, five through eight page and nine plus page documents. I have the start of a solution that seems to work fairly well but I have no real reference point for how good or bad my solution is other than my own opinion.

I was hoping that you could take a look at the code and see if I'm on the right track or if there are some egregious errors that need to be addressed. I assume my coding is very child like so please let me know if things should be formatted differently than what I have chosen. I am reading a ton of resource material and that's progressing well, but because of the size of these files and the time that its going to take to process them I was hoping to lean on your expertise to make sure its not horribly less efficient than it should be. I just wanted to get some guidance before I continued on.

Here is what I have so far, if you have any questions please let me know, and thanks in advance!!

**NOTE** Delete page sub routine is where I stopped so its currently commented out. I have to use it to clean up the blank page that is created by making the files initially. I tried not adding a page get a non-initialized error if it makes a blank pdf file.

#!/usr/bin/perl use strict; use warnings; use CAM::PDF; use PDF::API2; #Define initial directory my $dir = "C:\\Users\\user name\\Desktop\\fpb uluro\\"; #Open data folder in directory, pull file names into an array, then cl +ose directory opendir(my $dh, $dir."data2") || die "can't opendir $dir"."data2: $!"; my @files = grep {!/^\./ && -f "$dir"."data2/$_" } readdir($dh); closedir $dh; #Create the group files to be used for sorting and appending create_pdf_files(); my $pdf = ""; my $prev_group = ""; my $count = 0; while (@files) { #Increment counter and keep cleaner window for in progress display $count++; system $^O eq 'MSWin32' ? 'cls' : 'clear'; print "Processing Record #$count. $files[0]\n"; #Grab the next file in the directory and get page count my $append_pdf = shift(@files); my $group = get_page_cnt($append_pdf); #Load initial group file if just starting $pdf = CAM::PDF->new($dir."finished\\".$group) if $count == 1; #Check to see if page count is the same as previous for sorting #If not update current group file and load new one if ($group ne $prev_group && $prev_group ne "") { $pdf->cleanoutput($dir."finished\\".$prev_group); $pdf = CAM::PDF->new($dir."finished\\".$group); } #Update group file for sorting $prev_group = $group; #Load pdf from directory and append to group file my $otherpdf = CAM::PDF->new($dir."data2\\".$append_pdf); $pdf->appendPDF($otherpdf); #After the last file is appended make sure to output since there #wont be a next group to trigger the output above $pdf->cleanoutput($dir."finished\\".$prev_group) if not @files; } delete_blank_page(); sub get_page_cnt { #Open the pdf to be added to the base pdf my $pdf_pages = PDF::API2->open($dir."data2\\".$_[0]); #Get a page count from the pdf my $pdf_page_cnt = $pdf_pages->pages; #Close the pdf to remove the structure from memory $pdf_pages->end(); #Use the page count of the pdf to determine which base pdf to use my $s = $pdf_page_cnt == 1 ? "one_page.pdf" : $pdf_page_cnt < 4 ? "two_page.pdf" : $pdf_page_cnt < 8 ? "three_to_four_page.pdf" : $pdf_page_cnt < 15 ? "five_to_eight_page.pdf" : "nine_plus_page.pdf"; # default +... return $s; } sub create_pdf_files { #Create blank pdf files for sorting pdfs based on page count my $create_pdf = PDF::API2->new(-file => $dir."finished\\".'one_pa +ge.pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'two_page. +pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'three_to_ +four_page.pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'five_to_e +ight_page.pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'nine_plus +_page.pdf'); $create_pdf->page(); $create_pdf->saveas(); return 1; } sub delete_blank_page { #my $del_pdf = CAM::PDF->new($dir."finished\\one_page.pdf"); #$del_pdf->deletePage(1); #$del_pdf->cleanoutput($dir."finished\\one_page.pdf"); #$del_pdf = CAM::PDF->new($dir."finished\\two_page.pdf"); #$del_pdf->deletePage(1); #$del_pdf = CAM::PDF->new($dir."finished\\three_to_four_page.pdf") +; #$del_pdf->deletePage(1); #$del_pdf = CAM::PDF->new($dir."finished\\five_to_eight_page.pdf") +; #$del_pdf->deletePage(1); #$del_pdf = CAM::PDF->new($dir."finished\\nine_plus_page.pdf"); #$del_pdf->deletePage(1); return 1; }

Replies are listed 'Best First'.
Re: Am I on the right track?
by AnomalousMonk (Archbishop) on Jun 19, 2015 at 00:31 UTC
    my @files = grep {!/^\./ && -f "$dir"."data2/$_" } readdir($dh);

    A small point: If you're trying to filter out the . and .. directories from what you are reading from the directory handle, be aware that  !/^\./ will also filter out files named something like '.foo'. The idiomatic pattern to use here would be  !/^\.\.?$/ or maybe  m{ \A [.]{1,2} \z }xms instead.


    Give a man a fish:  <%-(-(-(-<

Re: Am I on the right track?
by graff (Chancellor) on Jun 19, 2015 at 02:29 UTC
    I'd raise a couple of concerns: First, are you sure that concatenating thousands of small pdfs into a few (presumably) huge pdfs is really going to give you whatever benefit you're hoping for? Why are you doing this in the first place?

    (If I were trying to reduce the quantity of files, and/or organize the files by relative size, I'd sort them into a few zip archive files. Maybe combined pdfs are very handy and flexible and quick to open and search through with easy random access - I don't know - but I know that this is true for zip files.)

    Second, it looks like your approach will be doing a lot of closing and reopening of those few large output files, and I'd worry that this might lead to a lot of thrashing, especially as the output files get bigger and you still have thousands more input files to append (and sort?). It would make more sense to scan all the inputs first, use a hash of arrays (or hash of hashes) to build up an overview of the inventory, and then create each of the outputs with a single, tightly nested loop - that is something like:

    my %lists_by_size; # get list of file names into @inp_files for my $file ( @inp_files ) { my $group = get_page_count( $file ); push @{$lists_by_size{$group}}, $file; } for my $group ( keys %lists_by_size ) { # open output pdf (or output zip file) for this group for my $file ( @{$lists_by_size{$group}} ) { # use sort here if y +ou like # append this file to the output } # close the output }

      Sorry it has taken me so long to respond. I had an emergency project come in that has taken my attention away from this one for the last little while.

      The reason I am building the large pdf files is because they are going to be passed to a piece of printing software, which requires only one file be passed to it per print job. we have decided to break things down along the lines I have defined to help with the processing overhead both on my end and then again on production.

      The way I have the opening/closing structured, was meant to help minimize them by only doing so once the page count determined that the pdf needed to go in a different file, but after reading your comment and thinking about it more, if I assume worst case and every set is as mixed as possible then I would indeed be doing a large amount of opens/closes. However do you think the performance hit would be greater than opening/closing the individual pdfs twice as opposed to once with hihger combined file opens/closes?

        I'm not sure I understand your question. It looks to me like the OP code opens every input file two times, once to get its page count, and once to append its content to a chosen output file. (Your "get_page_cnt()" sub only returns a group name, not a file handle or pdf content.) My suggestion is no different in that regard.

        Where my suggestion differs is that all the inputs are scanned first, before any output is done, and then there's a nested loop: for each output "group" file, create it, then for each input file in that group, concatenate its content. (If none of the inputs fall into a given group, there's no need to create an output file for that group.)

        Opening and closing each output file exactly once is bound to involve less overhead on the whole, compared to randomly closing and reopening output files (but I have no idea whether the difference will be noticeable in wall-clock time).

        Another thing to consider is whether you have to worry about an upper bound on the amount of data you can concatenate into one pdf file for a single print job. If so, I think my suggested approach would make it easier to manage that, because you can work out all the arithmetic for partitioning before creating any outputs.

Re: Am I on the right track?
by Laurent_R (Canon) on Jun 18, 2015 at 21:19 UTC
    Hi Pharazon,

    I can't really comment at all on the way you use of the PDF modules you are using, I don't know them.

    But, overall, I find your Perl code well written, well structured, clean and sensible, sometimes even idiomatic, usually in agreement with the coding good practices generally accepted by the Perl Community.

    The only thing that I am wondering about, but this may be my lack of knowledge of the PDF modules, is that you don't seem to worry about error checking when you open a new PDF file or do other similar OS operations. Again, I don't know these modules, but I would tend to think that you probably need to check if you failed to open a file and to do something about it if such is the case.

    So, to me, yes, you are on the right track, with the possible exception of error checking.

      I will likely need to add in some amount of error checking, though it should end up being fairly light because the environment that this will run in is highly controlled so things shouldn't go to crazy. As of right now I fully control the test environment and had been focusing on performance but I added a reminder to take time after performance to flesh out error checking before moving on.

      Thank you for the feedback on the code itself as well. The language is nice and seems very strong, but it writes much differently than I'm used to so I'm glad that I'm getting that part fairly correct.