comment on

I think your two-pass approach is fine in principle. Because you can't know in advance if a record has duplicates, you'll have to keep the ID of all records in memory just in case, in a single-pass approach. Whether this is feasible, it depends on the expected size of the file.

A suggestion for a single-pass approach would be to make $dup{key} an array-ref:

while (<STDIN>) {
   chomp;
   if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) {
      ($id, $key)= ($1, $2);
      push @{$dup{$key}}, $id;  #this is modified!
      # check if any key has init caps (not allowed)
      if ($key =~ m/^[A-Z]\w*/) {
         print "Id: $id - $key\n";
      }
   }
}

print "Duplicated keys:\n";

for my $key (keys %dup) {
   my ($ids, $count) = map {$_, scalar @$_} $dup{$key};
   next unless $count > 1;
   print "$key ($count)\n";
   print "Id: $_\n" for @$ids;
}
[download]

Update: I failed to see that you read-in the whole file in @lines to begin with. Code modified to avoid this. My comment about single-pass / two pass becomes a bit irrelevant in the new light.

More

This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below:

This confused me at first, because I didn't read your code carefully. Actually, in your original code, you don't go through the file twice (in I/O terms). You actually read the whole file line by line into an array, then loop over that array twice, populating a hash in the first pass. My solution also goes through the file only once (in a while loop), populating a deep data structure (hash of arrays), then, in a second loop, it goes over the elements of that hash printing those with more than one ID.

As for writing the whole program in a single loop it's not possible, because you have to basically group-by. Random_Walk above cheats by assuming there can be a maximum of a single duplicate for any given textual key.

In reply to Re: I sense there is a simpler way... by calin
in thread I sense there is a simpler way... by HelgeG

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.