Re: Am I on the right track?

I'd raise a couple of concerns: First, are you sure that concatenating thousands of small pdfs into a few (presumably) huge pdfs is really going to give you whatever benefit you're hoping for? Why are you doing this in the first place?

(If I were trying to reduce the quantity of files, and/or organize the files by relative size, I'd sort them into a few zip archive files. Maybe combined pdfs are very handy and flexible and quick to open and search through with easy random access - I don't know - but I know that this is true for zip files.)

Second, it looks like your approach will be doing a lot of closing and reopening of those few large output files, and I'd worry that this might lead to a lot of thrashing, especially as the output files get bigger and you still have thousands more input files to append (and sort?). It would make more sense to scan all the inputs first, use a hash of arrays (or hash of hashes) to build up an overview of the inventory, and then create each of the outputs with a single, tightly nested loop - that is something like:

my %lists_by_size;

# get list of file names into @inp_files

for my $file ( @inp_files ) {
    my $group = get_page_count( $file );
    push @{$lists_by_size{$group}}, $file;
}

for my $group ( keys %lists_by_size ) {
    # open output pdf (or output zip file) for this group
    for my $file ( @{$lists_by_size{$group}} ) {  # use sort here if y
+ou like
        # append this file to the output
    }
    # close the output
}
[download]

Comment on Re: Am I on the right track? Download Code

Replies are listed 'Best First'.
Re^2: Am I on the right track? by Pharazon (Acolyte) on Jul 07, 2015 at 20:09 UTC
Sorry it has taken me so long to respond. I had an emergency project come in that has taken my attention away from this one for the last little while. The reason I am building the large pdf files is because they are going to be passed to a piece of printing software, which requires only one file be passed to it per print job. we have decided to break things down along the lines I have defined to help with the processing overhead both on my end and then again on production. The way I have the opening/closing structured, was meant to help minimize them by only doing so once the page count determined that the pdf needed to go in a different file, but after reading your comment and thinking about it more, if I assume worst case and every set is as mixed as possible then I would indeed be doing a large amount of opens/closes. However do you think the performance hit would be greater than opening/closing the individual pdfs twice as opposed to once with hihger combined file opens/closes?	[reply]
Re^3: Am I on the right track? by graff (Chancellor) on Jul 12, 2015 at 19:23 UTC
I'm not sure I understand your question. It looks to me like the OP code opens every input file two times, once to get its page count, and once to append its content to a chosen output file. (Your "get_page_cnt()" sub only returns a group name, not a file handle or pdf content.) My suggestion is no different in that regard. Where my suggestion differs is that all the inputs are scanned first, before any output is done, and then there's a nested loop: for each output "group" file, create it, then for each input file in that group, concatenate its content. (If none of the inputs fall into a given group, there's no need to create an output file for that group.) Opening and closing each output file exactly once is bound to involve less overhead on the whole, compared to randomly closing and reopening output files (but I have no idea whether the difference will be noticeable in wall-clock time). Another thing to consider is whether you have to worry about an upper bound on the amount of data you can concatenate into one pdf file for a single print job. If so, I think my suggested approach would make it easier to manage that, because you can work out all the arithmetic for partitioning before creating any outputs.	[reply]
Re^4: Am I on the right track? by Pharazon (Acolyte) on Jul 13, 2015 at 17:08 UTC
It's possible that my code isn't doing what I think that it is, but I don't think I am opening the input file twice. If I'm correct in my thinking (based on my interpretation of the modules documentation) what it does is open an input pdf and check to see how many pages it has. Then it looks to see if the output file that is currently open (if there is one) is the one that this pdf needs to be added to. If yes then it appends the input to the output and moves to the next input. If not it writes the output that is currently open (clearing it from memory and clearing the stream of data that was being built) and then opens the correct output file and appends the input to output and moves to the next input. If the input files belong to the same output file in succession then the output file stays open having data added to its stream (term might not be right) until such time that an input file that goes elsewhere is opened at which time it will be written. So I should only be opening the input files once, but assuming every other file was supposed to go in a different output file then I would be doing a lot of opening closing of those files, though realistically large swaths of the data belong in the 1 and 2 page file with the others being more sporadic. I am going to test the hash method you mentioned regardless but I hope that makes my intentions and code a bit clearer.	[reply]
Re^5: Am I on the right track? by graff (Chancellor) on Jul 18, 2015 at 01:28 UTC