Good morning Monks. I'm working on a concept script right now to replace a KSH script we use (concept = not final var names, debug statements, no real comments, etc). The script parses through an input file that can be up to 5 million lines long. It takes that data and does 2 things with it: builds an array out of all values in one location via "split", and takes a substring of those same values to use as hash keys. What I want it to do then is to loop through the hash keys and fork a new process in parallel for each to grep through the array for the value and get a count. My code is below. My issues that I am asking for help on are threefold:

1.) The forking works, but it seems to get done serially. I would like it to happen in parallel for speed purposes.
2.) I am having issues passing the hash into the sub and/or returning the data back out. I get the following errors every run (multiple instances of each), and my "out" files don't get written successfully:
a. Use of uninitialized value in concatenation (.) or string at ./GetNPANXXCount.pl line 98.
b. Use of uninitialized value in numeric comparison (<=>) at ./GetNPANXXCount.pl line 101.
c. Use of uninitialized value in printf at ./GetNPANXXCount.pl line 102.
3.) It seems like there would be some room for improvement on this code for efficiency to speed things along, but I can't seem to find it. Any suggestions would be very appreciated!!!

Here's my code thus far:

#Sample Line From BIGFILE #{9999991234ff00aa},9999991234,1,"Y",0,0,{55760FFC56837F3E} my %minhash = (); my %npanxxhash = (); my @npanxxarray; my $key; my $value; my $npanxx; my $npanxxcnt; my $in = "BIGFILE.out.gz"; my $out_min = "npanxx_minsort.out"; my $out_cnt = "npanxx_cntsort.out"; open IN, "/bin/gunzip -c $in |" or die "IN: $!\n"; open OUT_MIN, ">", "$out_min" or die "OUT_MIN: $!\n"; open OUT_CNT, ">", "$out_cnt" or die "OUT_CNT: $!\n"; print "Time: " . time . "\n"; print "Processing $in...\n"; while (<IN>) { if ( $_ =~ m/^{.*$/ ) { #Grab 9999991234 from line above my ($a,$MIN,$c,$d,$e,$f) = split( /,/ ); $minhash{$MIN} = undef; } } close IN; print "Time: " . time . "\n"; print "Massaging Data...\n"; while ( ($key, $value) = each(%minhash) ) { #Get just 999999 from above $npanxx = substr($key, 0, 6); push(@npanxxarray, $npanxx); $npanxxhash{$npanxx} = undef; } undef $key; undef $value; print "Time: " . time . "\n"; print "Getting Counts...\n"; foreach $key (sort keys %npanxxhash) { &CountAndHash($key,\@npanxxarray,\%npanxxhash); # $npanxxcnt = grep (/$key/, @npanxxarray); # $npanxxhash{$key} = $npanxxcnt; } print "Time: " . time . "\n"; print "Generating Flat Files...\n"; foreach $key (sort keys %npanxxhash) { print OUT_MIN "$key $npanxxhash{$key}\n"; } foreach $key (sort { $npanxxhash{$a} <=> $npanxxhash{$b} } keys %npanx +xhash) { printf OUT_CNT "%-7s %s\n", $key, $npanxxhash{$key}; } print "Time: " . time . "\n"; print "Complete...\n"; sub CountAndHash { my ($key, $arrayref, $hashref) = @_; my %hashref; if (!defined(my $pid = fork())) { die "Cannot fork to child: $!\n"; } elsif ($pid == 0) { #print "Launching child process...\n"; $npanxxcnt = grep (/$key/, $arrayref); $hashref{$key} = $npanxxcnt; exit; } else { my $ret = waitpid($pid,0); print "PID $ret completed...\n"; } return ($npanxxcnt, $hashref); }

Thanks in advance for your help!!

UPDATE UPDATE UPDATE 2014-08-09

Thanks to one and all for your assistance. I am going to abandon this question as I have totally redone my code per aitap's suggestion below - but now I have a question related to the new code that's not pertinent here.

Thanks again for the help, monks!!


In reply to UPDATED: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations by ImJustAFriend

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.