ImJustAFriend has asked for the wisdom of the Perl Monks concerning the following question:
Good morning Monks. I'm working on a concept script right now to replace a KSH script we use (concept = not final var names, debug statements, no real comments, etc). The script parses through an input file that can be up to 5 million lines long. It takes that data and does 2 things with it: builds an array out of all values in one location via "split", and takes a substring of those same values to use as hash keys. What I want it to do then is to loop through the hash keys and fork a new process in parallel for each to grep through the array for the value and get a count. My code is below. My issues that I am asking for help on are threefold:
1.) The forking works, but it seems to get done serially. I would like it to happen in parallel for speed purposes.
2.) I am having issues passing the hash into the sub and/or returning the data back out. I get the following errors every run (multiple instances of each), and my "out" files don't get written successfully:
a. Use of uninitialized value in concatenation (.) or string at ./GetNPANXXCount.pl line 98.
b. Use of uninitialized value in numeric comparison (<=>) at ./GetNPANXXCount.pl line 101.
c. Use of uninitialized value in printf at ./GetNPANXXCount.pl line 102.
3.) It seems like there would be some room for improvement on this code for efficiency to speed things along, but I can't seem to find it. Any suggestions would be very appreciated!!!
Here's my code thus far:
#Sample Line From BIGFILE #{9999991234ff00aa},9999991234,1,"Y",0,0,{55760FFC56837F3E} my %minhash = (); my %npanxxhash = (); my @npanxxarray; my $key; my $value; my $npanxx; my $npanxxcnt; my $in = "BIGFILE.out.gz"; my $out_min = "npanxx_minsort.out"; my $out_cnt = "npanxx_cntsort.out"; open IN, "/bin/gunzip -c $in |" or die "IN: $!\n"; open OUT_MIN, ">", "$out_min" or die "OUT_MIN: $!\n"; open OUT_CNT, ">", "$out_cnt" or die "OUT_CNT: $!\n"; print "Time: " . time . "\n"; print "Processing $in...\n"; while (<IN>) { if ( $_ =~ m/^{.*$/ ) { #Grab 9999991234 from line above my ($a,$MIN,$c,$d,$e,$f) = split( /,/ ); $minhash{$MIN} = undef; } } close IN; print "Time: " . time . "\n"; print "Massaging Data...\n"; while ( ($key, $value) = each(%minhash) ) { #Get just 999999 from above $npanxx = substr($key, 0, 6); push(@npanxxarray, $npanxx); $npanxxhash{$npanxx} = undef; } undef $key; undef $value; print "Time: " . time . "\n"; print "Getting Counts...\n"; foreach $key (sort keys %npanxxhash) { &CountAndHash($key,\@npanxxarray,\%npanxxhash); # $npanxxcnt = grep (/$key/, @npanxxarray); # $npanxxhash{$key} = $npanxxcnt; } print "Time: " . time . "\n"; print "Generating Flat Files...\n"; foreach $key (sort keys %npanxxhash) { print OUT_MIN "$key $npanxxhash{$key}\n"; } foreach $key (sort { $npanxxhash{$a} <=> $npanxxhash{$b} } keys %npanx +xhash) { printf OUT_CNT "%-7s %s\n", $key, $npanxxhash{$key}; } print "Time: " . time . "\n"; print "Complete...\n"; sub CountAndHash { my ($key, $arrayref, $hashref) = @_; my %hashref; if (!defined(my $pid = fork())) { die "Cannot fork to child: $!\n"; } elsif ($pid == 0) { #print "Launching child process...\n"; $npanxxcnt = grep (/$key/, $arrayref); $hashref{$key} = $npanxxcnt; exit; } else { my $ret = waitpid($pid,0); print "PID $ret completed...\n"; } return ($npanxxcnt, $hashref); }
Thanks in advance for your help!!
UPDATE UPDATE UPDATE 2014-08-09
Thanks to one and all for your assistance. I am going to abandon this question as I have totally redone my code per aitap's suggestion below - but now I have a question related to the new code that's not pertinent here.Thanks again for the help, monks!!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by aitap (Curate) on Aug 08, 2014 at 15:32 UTC | |
by ImJustAFriend (Scribe) on Aug 08, 2014 at 16:10 UTC | |
|
Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by atcroft (Abbot) on Aug 08, 2014 at 14:52 UTC | |
|
Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by GotToBTru (Prior) on Aug 08, 2014 at 15:20 UTC | |
by Anonymous Monk on Aug 09, 2014 at 00:20 UTC | |
|
Re: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations
by graff (Chancellor) on Aug 09, 2014 at 15:46 UTC |