Pharazon has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks,
I am relatively new to programming in perl and am learning in a sort of trial by fire kind of way since I'm working directly on projects for work. My current problem is that I have a very large number of pdfs (22,000 +) of varying numbers of pages. I need to take those pdfs and combine them into single files split by a few groups. For example my groups will be one page, two page, three and four page, five through eight page and nine plus page documents. I have the start of a solution that seems to work fairly well but I have no real reference point for how good or bad my solution is other than my own opinion.
I was hoping that you could take a look at the code and see if I'm on the right track or if there are some egregious errors that need to be addressed. I assume my coding is very child like so please let me know if things should be formatted differently than what I have chosen. I am reading a ton of resource material and that's progressing well, but because of the size of these files and the time that its going to take to process them I was hoping to lean on your expertise to make sure its not horribly less efficient than it should be. I just wanted to get some guidance before I continued on.
Here is what I have so far, if you have any questions please let me know, and thanks in advance!!
**NOTE** Delete page sub routine is where I stopped so its currently commented out. I have to use it to clean up the blank page that is created by making the files initially. I tried not adding a page get a non-initialized error if it makes a blank pdf file.
#!/usr/bin/perl use strict; use warnings; use CAM::PDF; use PDF::API2; #Define initial directory my $dir = "C:\\Users\\user name\\Desktop\\fpb uluro\\"; #Open data folder in directory, pull file names into an array, then cl +ose directory opendir(my $dh, $dir."data2") || die "can't opendir $dir"."data2: $!"; my @files = grep {!/^\./ && -f "$dir"."data2/$_" } readdir($dh); closedir $dh; #Create the group files to be used for sorting and appending create_pdf_files(); my $pdf = ""; my $prev_group = ""; my $count = 0; while (@files) { #Increment counter and keep cleaner window for in progress display $count++; system $^O eq 'MSWin32' ? 'cls' : 'clear'; print "Processing Record #$count. $files[0]\n"; #Grab the next file in the directory and get page count my $append_pdf = shift(@files); my $group = get_page_cnt($append_pdf); #Load initial group file if just starting $pdf = CAM::PDF->new($dir."finished\\".$group) if $count == 1; #Check to see if page count is the same as previous for sorting #If not update current group file and load new one if ($group ne $prev_group && $prev_group ne "") { $pdf->cleanoutput($dir."finished\\".$prev_group); $pdf = CAM::PDF->new($dir."finished\\".$group); } #Update group file for sorting $prev_group = $group; #Load pdf from directory and append to group file my $otherpdf = CAM::PDF->new($dir."data2\\".$append_pdf); $pdf->appendPDF($otherpdf); #After the last file is appended make sure to output since there #wont be a next group to trigger the output above $pdf->cleanoutput($dir."finished\\".$prev_group) if not @files; } delete_blank_page(); sub get_page_cnt { #Open the pdf to be added to the base pdf my $pdf_pages = PDF::API2->open($dir."data2\\".$_[0]); #Get a page count from the pdf my $pdf_page_cnt = $pdf_pages->pages; #Close the pdf to remove the structure from memory $pdf_pages->end(); #Use the page count of the pdf to determine which base pdf to use my $s = $pdf_page_cnt == 1 ? "one_page.pdf" : $pdf_page_cnt < 4 ? "two_page.pdf" : $pdf_page_cnt < 8 ? "three_to_four_page.pdf" : $pdf_page_cnt < 15 ? "five_to_eight_page.pdf" : "nine_plus_page.pdf"; # default +... return $s; } sub create_pdf_files { #Create blank pdf files for sorting pdfs based on page count my $create_pdf = PDF::API2->new(-file => $dir."finished\\".'one_pa +ge.pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'two_page. +pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'three_to_ +four_page.pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'five_to_e +ight_page.pdf'); $create_pdf->page(); $create_pdf->saveas(); $create_pdf = PDF::API2->new(-file => $dir."finished\\".'nine_plus +_page.pdf'); $create_pdf->page(); $create_pdf->saveas(); return 1; } sub delete_blank_page { #my $del_pdf = CAM::PDF->new($dir."finished\\one_page.pdf"); #$del_pdf->deletePage(1); #$del_pdf->cleanoutput($dir."finished\\one_page.pdf"); #$del_pdf = CAM::PDF->new($dir."finished\\two_page.pdf"); #$del_pdf->deletePage(1); #$del_pdf = CAM::PDF->new($dir."finished\\three_to_four_page.pdf") +; #$del_pdf->deletePage(1); #$del_pdf = CAM::PDF->new($dir."finished\\five_to_eight_page.pdf") +; #$del_pdf->deletePage(1); #$del_pdf = CAM::PDF->new($dir."finished\\nine_plus_page.pdf"); #$del_pdf->deletePage(1); return 1; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Am I on the right track?
by AnomalousMonk (Archbishop) on Jun 19, 2015 at 00:31 UTC | |
|
Re: Am I on the right track?
by graff (Chancellor) on Jun 19, 2015 at 02:29 UTC | |
by Pharazon (Acolyte) on Jul 07, 2015 at 20:09 UTC | |
by graff (Chancellor) on Jul 12, 2015 at 19:23 UTC | |
by Pharazon (Acolyte) on Jul 13, 2015 at 17:08 UTC | |
by graff (Chancellor) on Jul 18, 2015 at 01:28 UTC | |
|
Re: Am I on the right track?
by Laurent_R (Canon) on Jun 18, 2015 at 21:19 UTC | |
by Pharazon (Acolyte) on Jul 07, 2015 at 20:14 UTC |