ImJustAFriend has asked for the wisdom of the Perl Monks concerning the following question:

Greetings and good evening. I posted a question on a concept script I'm working on to replace a KSH script on Friday morning, which lead to my code being totally revamped. Now I have an issue with my new code I am hoping someone will be able to help me out with. Here goes...

First, a Cliff's Notes explanation of what I'm trying to do. I have an input file that can be up to 5 million lines long, with each line I am interested in (and the bulk of the file) formatted thusly:
{9999991234ff00aa},9999991234,1,"Y",0,0,{55760FFC56837F3E}
I need to grab field 2 (min), substring it to first 6 characters (NPANXX), and count the MINs that start with the NPANXX and output to 2 files that are both patterned as "NPANXX<space>MIN count" - one line per NPANXX record. File 1 is sorted by MIN, file 2 is sorted by count. Here's my code thus far:

#!/usr/bin/perl my %counts = (); my %list = (); my $keys; my $total; my $in = "BIGFILE.out.gz"; my $out_min = "npanxx_minsort.out"; my $out_cnt = "npanxx_cntsort.out"; my $debug_out = "Test.debug"; open IN, "/bin/gunzip -c $in |" or die "IN: $!\n"; open OUT_MIN, ">", "$out_min" or die "OUT_MIN: $!\n"; open OUT_CNT, ">", "$out_cnt" or die "OUT_CNT: $!\n"; open DEBUG, ">", "$debug_out" or die "DEBUG: $!\n"; print "Processing $in...\n"; $total = 0; while (<IN>) { chomp($_); next unless m/^{.*$/; my $min = (split ',')[1]; my $npanxx = substr($min, 0, 6); push (@{$list{$npanxx}}, $min); #LogToFile("Test.out", "$_ -> $min & $npanxx\n"); $counts{$npanxx} += 1; $total++; } print "$_ $counts{$_}\n" for (sort keys %counts); print "Total: $total\n"; print "Generating Flat Files...\n"; foreach $key (sort keys %counts) { print OUT_MIN "$key $counts{$key}\n"; } `echo "Total: $total" >> npanxx_minsort.out`; foreach $key (sort { $counts{$a} <=> $counts{$b} } keys %counts) { printf OUT_CNT "%-7s %s\n", $key, $counts{$key}; } `echo "Total: $total" >> npanxx_cntsort.out`; sub LogToFile { my ($file, $msg) = @_; my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = loc +altime(time); $year += 1900; $mon += 1; if ($sec <= 9) { $sec = "0" . $sec; } if ($min <= 9) { $min = "0" . $min; } if ($hour <= 9) { $hour = "0" . $hour; } if ($mon <= 9) { $mon = "0" . $mon; } if ($mday <= 9) { $mday = "0" . $mday; } my $stamp = $year . "/" . $mon . "/" . $mday . " " . $hour . " +:" . $min . ":" . $sec; my $str = "[" . $stamp . "] " . $msg; open FILE, ">>", $file; print FILE $str; close FILE; }

Everything works awesome - it does EXACTLY what it's supposed to do... with one minor flaw. Turns out the KSH script this will replace is de-duping the MIN that's being captured prior to counting it. I have struggled with implementing something to perform the de-duping to solve this issue since yesterday and can't see the answer in my head to fix this. I think it's going to probably go as an "unless" next to "$counts{$npanxx} += 1" but I have tried about every permutation of code I can think of there and nothing works. I think it is my lack of understanding on that AoH situation.

I would be very grateful to anyone who can get me to the right place on this one. Once this issue is resolved, I can move onto something else... LOL!!!!

Thanks all, as always!!

Replies are listed 'Best First'.
Re: Find If Value Exists In Array Of Hashes
by NetWallah (Canon) on Aug 10, 2014 at 05:58 UTC
    • You seem to be collecting info in %list, but never use it. Why bother ?
    • You will need to explain what you mean by "de-duping the MIN that's being captured prior to counting it". The hash %count, by it's very nature, de-dupes the key. Perhaps an example of what the ksh does that your code does not, would help.

            Profanity is the one language all programmers know best.

      Thanks NetWallah for the questions. While getting sample data to answer them, I noticed there is a way to de-dup built right into the data stream lines (field 4, the "Y" in my sample, has to be "Y" -> if it's "N", it's a duplicate). If you wouldn't have asked for clarification, I never would have seen that!!

      Thank you so much for leading me to a FAR simpler solution than I imagined!!!

        Instead of all of those if clauses to add leading zeroes to the date / time elements could you not just use sprintf ? (untested)
        my $stamp = sprintf '%02d/%02d/%02d.%02d.%02d.%02d', $year, $mon, $mday, $hour, $min, $sec