Re^3: I sense there is a simpler way...

Is gobbling an entire file into an array considered bad form? . . .

One should always be aware of the efficiency concern. If you're sure the file will never be "too big", sluurping (as it's called) shouldn't be a problem. Otherwise, you'd do well to try to do per-record reading/processing, where practical.

Calin's solution is good. If you want a little extra efficiency, you can buy it with memory, i.e. data structures. In the solution below, we maintain a separate hash for those keys which are known to be duplicates. Then, at the end, we iterate only over that hash. This has a pay-off if the number of duplicate keys is significantly smaller than the total number of keys.

my( %keys, %dup );
while (<STDIN>)
{
   chomp;
   if ( /PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/ )
   {
      my( $id, $key ) = ( $1, $2 );
      if ( exists $dup{$key} ) # already found to be a dup
      {
         push @{ $dup{$key} }, $id;
      }
      elsif ( exists $keys{$key} ) # only seen once before
      {
         push @{ $dup{$key} }, delete($keys{$key}), $id;
      }
      else # first time seen
      {
         $keys{$key} = $id;
      }

      # check if any key has init caps (not allowed)
      if ( $key =~ /^[A-Z]\w*/ )
      {
         print "Id: $id - $key\n";
      }
   }
}

print "\nDuplicated keys:\n\n";
for my $key ( keys %dup )
{
   print "Key: $key\n";
   print "\tId: $_\n" for @{$dup{$key}};
}
[download]

(Not tested)

Comment on Re^3: I sense there is a simpler way... Download Code

Replies are listed 'Best First'.
Re^4: I sense there is a simpler way... by HelgeG (Scribe) on Aug 23, 2004 at 09:43 UTC
jdporter, thanks. I knew there was a reason for entering the monastery,and the replies I have received to my query have been interesting and educating. I like this last solution where instead of going through the entire data on the second pass, we only look at known duplicates. Having worked with perl for the last weeks has been somewhat of a revelation to me. it is amazing how much real work can be accomplished with a few lines of carefully chosen code.	[reply]

Replies are listed 'Best First'.

Re^4: I sense there is a simpler way...
by HelgeG (Scribe) on Aug 23, 2004 at 09:43 UTC

I like this last solution where instead of going through the entire data on the second pass, we only look at known duplicates.

Having worked with perl for the last weeks has been somewhat of a revelation to me. it is amazing how much real work can be accomplished with a few lines of carefully chosen code.

[reply]