Counting unique instances from Array Sort

ewhitt has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Counting unique instances from Array Sort by Skeeve (Parson) on Jan 07, 2008 at 09:22 UTC
The answer is given more than once here. But again: `my %ip_count; ++$ip_count{$_} for (@ipAddresses);` [download] `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^2: Counting unique instances from Array Sort by ewhitt (Scribe) on Jan 07, 2008 at 09:53 UTC
I am not sure if I follow. What I am trying to do is take: `my @ipAddresses = ("172.16.16.1 ", "172.16.16.1 ", "172.16.16.1 ", "17 +2.16.16.2 ");` [download] and be able to print something like this while / after the sort: `172.16.16.1,3 172.16.16.2,1` [download] Thanks!	[reply] [d/l] [select]
Re: Counting unique instances from Array Sort by ikegami (Patriarch) on Jan 07, 2008 at 09:45 UTC
The idiomatic way of eliminating duplicates creates an hash of counts as a byproduct. `my %counts; my @uniqueIPs = grep !$counts{$_}++, @ipAddresses; for my $ip (sort @uniqueIPs) { print("$ip ($counts{$ip})\n"); }` [download]	[reply] [d/l]
Re^2: Counting unique instances from Array Sort: explanation needed by bradcathey (Prior) on Jan 07, 2008 at 13:22 UTC
Okay, I've seen and used this a hundred times, but I have to come clean...I have no idea what is going on "under the hood" with this: `my %counts; my @uniqueIPs = grep !$counts{$_}++, @ipAddresses;` [download] I'd like to have someone explain it to me in a sentence or two. But let me take a stab at it first: 1. The first element in the list `@ipAddresses` becomes the first key in the hash `%counts`. 2. And because the first element of `@ipAddresses` ($_) is not equal to the first element of `@uniqueIPs` (mainly because it's empty), the first element of `@ipAddresses` ($_) is "pushed" onto `@uniqueIPs`. 3. HERE'S WHERE I GET LOST: What is incrementing that value? The fact that it satisfies the condition of not matching, therefore being true? And let's say the next element in `@ipAddresses` is the same as the first, and because it IS equal to the first element of `@uniqueIPs` it is not pushed. QUESTION: Why would the value of that first key get incremented. So, is the condition asking "if it is not equal then increment it?" What am I not getting? Thanks. —Brad "The important work of moving the world forward does not wait to be done by perfect men." George Eliot	[reply] [d/l] [select]
Re^3: Counting unique instances from Array Sort: explanation needed by ikegami (Patriarch) on Jan 07, 2008 at 17:35 UTC
`my @b = grep EXPR, @a;` [download] is equivalent to `my @b; foreach (@a) { if (EXPR) { push @b, $_; } }` [download] In this case: `my @uniqueIPs; foreach (@uniqueIPs) { if (!$counts{$_}++) { push @ipAddresses, $_; } }` [download] As for post-incrementing, `$x++` [download] is equivalent to `my $orig_x = $x; ++$x; $orig_x` [download] In this case: `my @uniqueIPs; foreach (@uniqueIPs) { my $old_count = $counts{$_}; $counts{$_}++; # Add one to count. if (!$old_count) { # If it's the first time we've seen it, push @ipAddresses, $_; # save it } }` [download] So, `my %counts; my @uniqueIPs = grep !$counts{$_}++, @ipAddresses;` [download] is short for `my @uniqueIPs; foreach (@uniqueIPs) { $counts{$_}++; # Add one to count. if ($counts{$_} == 1) { # If it's the first time we've seen it, push @ipAddresses, $_; # save it } }` [download] It is dense, but it's tried and true. Just add a comment for the less learned readers. `# Remove duplicates IP addresses by counting # the number of times each address occurs. my %counts; my @uniqueIPs = grep !$counts{$_}++, @ipAddresses;` [download] Oh by the way, you could also write it as follows if it's less confusing: `# Remove duplicates IP addresses by counting # the number of times each address occurs. my %counts; my @uniqueIPs = grep ++$counts{$_} == 1, @ipAddresses;` [download]	[reply] [d/l] [select]
Re^4: Counting unique instances from Array Sort: explanation needed by bradcathey (Prior) on Jan 07, 2008 at 19:09 UTC
Re^5: Counting unique instances from Array Sort: explanation needed by ikegami (Patriarch) on Jan 07, 2008 at 20:19 UTC
Re^3: Counting unique instances from Array Sort: explanation needed by Anonymous Monk on Jan 07, 2008 at 14:20 UTC
This question deserves a longer and clearer answer than I have time to give, but the first step to understanding this idiomatic piece of code is realizing that the answer to the question Why would the value of that first key get incremented? is that incrementation (and post-incrementation at that, which is the other key to figuring out the idiom) happens because it is always explicitly applied to the value of the current key of the `%counts` hash; there is nothing whatsoever conditional about the incrementation.	[reply] [d/l]
Re^3: Counting unique instances from Array Sort: explanation needed by NetWallah (Canon) on Jan 07, 2008 at 17:00 UTC
Ok - here is a functional decomposition of "my @uniqueIPs = grep !$counts{$_}++, @ipAddresses;": my @uniqueIPs; #Just declare it as an empty array; # The resulting value in %count is equivalent to the execution of t +his statement: $counts{$_}++ for @ipAddresses; # This is the important piece. # The hash %counts uses each ipAddress as a key. # The first time an IP address is encountered, the key-value pair is c +reated, with a value of zero. # the(++) increments that to 1. # The next time that IP is encountered, the value is incremented. So, + at the end, the value for each key # contains the count of occurrances for that IP address (key). ... grep !$counts{$_}++, @ipAddresses # The "grep:searches through each value ($counts{$_} is the VALUE), +and the negation (!) # looks for non-zero values. Because the operator used is post-increm +ent($blah++), before negation, # the value returnedwill be zero for First-seen IP addresses only. +For second and subsequent # sightings of the IP, the pre-negation value will be non-zero, an +d post-negation will be zero. # "grep" filters out, leaving only non-zero values, effectively retu +rning all the KEYS of %counts, which is # all the unique IP's. [download] "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply] [d/l]
Re: Counting unique instances from Array Sort by ambrus (Abbot) on Jan 07, 2008 at 09:26 UTC
See the question "How can I print out a word-frequency or line-frequency summary?" in perlfaq6.	[reply]
Re: Counting unique instances from Array Sort by locked_user sundialsvc4 (Abbot) on Jan 07, 2008 at 19:02 UTC
The essential idea here is that we are using a hash to keep track of the counts. A hash, as you well know, is a very efficient way to look-up a moderate number of values based on a “key” such as IP-address. Now here's where the coder decided to take advantage of one of Perl's many “shortcuts.” He “knew” that if you increment ("`++`") a hash-key that doesn't exist yet, Perl will “helpfully” treat that key as though it did exist had the value zero. As for me, I don't like to see code like that. The case where a particular key does not yet exist in a hash-structure is logically distinct from the case where it does. Therefore, I prefer to see that distinction expressly taken care of within the code, even if the resulting code is “inefficient.” So I prefer to have something very pedantic, like: (complete with comments!) ... `# Maintain a running count of all the unique IP-addresses seen ... foreach my $key (@address_list) { if (defined($ip_occurs[$key])) { $ip_occurs[$key]++; # seen again ... } else { $ip_occurs[$key] = 1; # first time ... } }` [download] (The above code has been edited to fix an obvious `dmub tpyo` ...) Notice that it does not matter in what order the keys are scanned when putting them into the hash-table. There is no reason to sort the keys in this loop. Instead, you will sort the keys when you extract them from the hash-table: `foreach my $key (sort keys %ip_occurs) { print "$key occurs $ip_occurs[$key] times\n"; }` [download] (Caution: extemporaneous Perl! Not responsible for tpy0s...)