comment on

Good morning Monks. I'm working on a concept script right now to replace a KSH script we use (concept = not final var names, debug statements, no real comments, etc). The script parses through an input file that can be up to 5 million lines long. It takes that data and does 2 things with it: builds an array out of all values in one location via "split", and takes a substring of those same values to use as hash keys. What I want it to do then is to loop through the hash keys and fork a new process in parallel for each to grep through the array for the value and get a count. My code is below. My issues that I am asking for help on are threefold:

1.) The forking works, but it seems to get done serially. I would like it to happen in parallel for speed purposes.
2.) I am having issues passing the hash into the sub and/or returning the data back out. I get the following errors every run (multiple instances of each), and my "out" files don't get written successfully:
a. Use of uninitialized value in concatenation (.) or string at ./GetNPANXXCount.pl line 98.
b. Use of uninitialized value in numeric comparison (<=>) at ./GetNPANXXCount.pl line 101.
c. Use of uninitialized value in printf at ./GetNPANXXCount.pl line 102.
3.) It seems like there would be some room for improvement on this code for efficiency to speed things along, but I can't seem to find it. Any suggestions would be very appreciated!!!

Here's my code thus far:

#Sample Line From BIGFILE
#{9999991234ff00aa},9999991234,1,"Y",0,0,{55760FFC56837F3E}
my %minhash = ();
my %npanxxhash = ();
my @npanxxarray;
my $key;
my $value;
my $npanxx;
my $npanxxcnt;
my $in = "BIGFILE.out.gz";
my $out_min = "npanxx_minsort.out";
my $out_cnt = "npanxx_cntsort.out";

open IN, "/bin/gunzip -c $in |" or die "IN: $!\n";
open OUT_MIN, ">", "$out_min" or die "OUT_MIN: $!\n";
open OUT_CNT, ">", "$out_cnt" or die "OUT_CNT: $!\n";

print "Time: " . time . "\n";
print "Processing $in...\n";

while (<IN>) {
        if ( $_ =~ m/^{.*$/ ) {
                #Grab 9999991234 from line above
                my ($a,$MIN,$c,$d,$e,$f) = split( /,/ );
                $minhash{$MIN} = undef;
        }
}
close IN;

print "Time: " . time . "\n";
print "Massaging Data...\n";

while ( ($key, $value) = each(%minhash) ) {
        #Get just 999999 from above
        $npanxx = substr($key, 0, 6);
        push(@npanxxarray, $npanxx);
        $npanxxhash{$npanxx} = undef;
}
undef $key;
undef $value;

print "Time: " . time . "\n";
print "Getting Counts...\n";

foreach $key (sort keys %npanxxhash) {
        &CountAndHash($key,\@npanxxarray,\%npanxxhash);
#       $npanxxcnt = grep (/$key/, @npanxxarray);
#       $npanxxhash{$key} = $npanxxcnt;
}

print "Time: " . time . "\n";
print "Generating Flat Files...\n";

foreach $key (sort keys %npanxxhash) {
        print OUT_MIN "$key $npanxxhash{$key}\n";
}

foreach $key (sort { $npanxxhash{$a} <=> $npanxxhash{$b} } keys %npanx
+xhash) {
        printf OUT_CNT "%-7s %s\n", $key, $npanxxhash{$key};
}

print "Time: " . time . "\n";
print "Complete...\n";

sub CountAndHash {
        my ($key, $arrayref, $hashref) = @_;
        my %hashref;

        if (!defined(my $pid = fork())) {
        die "Cannot fork to child: $!\n";
        } elsif ($pid == 0) {
                #print "Launching child process...\n";
                $npanxxcnt = grep (/$key/, $arrayref);
                $hashref{$key} = $npanxxcnt;
                exit;
        } else {
                my $ret = waitpid($pid,0);
                print "PID $ret completed...\n";
        }

        return ($npanxxcnt, $hashref);
}
[download]

Thanks in advance for your help!!

UPDATE UPDATE UPDATE 2014-08-09

Thanks to one and all for your assistance. I am going to abandon this question as I have totally redone my code per aitap's suggestion below - but now I have a question related to the new code that's not pertinent here.

Thanks again for the help, monks!!

In reply to UPDATED: Forking On Foreach Keys In Hash, Passing Hash To Sub, And Speed/Efficiency Recommendations by ImJustAFriend

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.