I think your two-pass approach is fine in principle. Because you can't know in advance if a record has duplicates, you'll have to keep the ID of all records in memory just in case, in a single-pass approach. Whether this is feasible, it depends on the expected size of the file.

A suggestion for a single-pass approach would be to make $dup{key} an array-ref:

while (<STDIN>) { chomp; if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) { ($id, $key)= ($1, $2); push @{$dup{$key}}, $id; #this is modified! # check if any key has init caps (not allowed) if ($key =~ m/^[A-Z]\w*/) { print "Id: $id - $key\n"; } } } print "Duplicated keys:\n"; for my $key (keys %dup) { my ($ids, $count) = map {$_, scalar @$_} $dup{$key}; next unless $count > 1; print "$key ($count)\n"; print "Id: $_\n" for @$ids; }

Update: I failed to see that you read-in the whole file in @lines to begin with. Code modified to avoid this. My comment about single-pass / two pass becomes a bit irrelevant in the new light.

More

This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below:

This confused me at first, because I didn't read your code carefully. Actually, in your original code, you don't go through the file twice (in I/O terms). You actually read the whole file line by line into an array, then loop over that array twice, populating a hash in the first pass. My solution also goes through the file only once (in a while loop), populating a deep data structure (hash of arrays), then, in a second loop, it goes over the elements of that hash printing those with more than one ID.

As for writing the whole program in a single loop it's not possible, because you have to basically group-by. Random_Walk above cheats by assuming there can be a maximum of a single duplicate for any given textual key.


In reply to Re: I sense there is a simpler way... by calin
in thread I sense there is a simpler way... by HelgeG

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.