Why does this hash remove duplicate lines

kangaroobin has asked for the wisdom of the Perl Monks concerning the following question:

I was searching around for a way to remove duplicate lines from a text file that weren't necessarily adjacent to one another, while maintaining the line order. I found this solution by BrowserUk:

#! perl -sw
use strict;
my %lines;
#open DATA, $ARGV[0] or die "Couldn't open $ARGV[0]: $!\n";
while (<DATA>) {
    print if not $lines{$_}++;
}

__DATA__
this is a line
this is another line
yet another
and yet another still
this is a line
more
and more
and even more
this is a line
and this
and that
but not the other cos its a family website:)
[download]

It worked for me, but I don't know why. I still don't fully see the magic of hashes. Can someone explain why this works? Specifically the if not $lines{$_}++ structure? I really don't see how incrementing works here. Also, could you replace if not with unless?

Thanks!

Comment on Why does this hash remove duplicate lines Download Code

Replies are listed 'Best First'.
Re: Why does this hash remove duplicate lines by mr_mischief (Monsignor) on Mar 06, 2008 at 05:31 UTC
`$lines{$_}` is a hash entry, with the value of $_ this time through the loop as the key. `$lines{$_}++` is incrementing the value of the hash entry with the key of $_. Since you're creating the key the first time you see a specific line, the value for that key is undefined. The first time you increment an undefined value in Perl you get 1. `print if not $lines{$_}++;` Undefined values are false, and 1 or higher are determined to be true values in boolean context. Basically, it's saying "Note this line. Note this line has been seen again if there's already a note about it. Print the line if it has not been seen before." The reason order is preserved is simply because the printing is happening at the time the line is seen and the while loop and diamond operator are reading the lines in order. As for whether you can use `unless`, that's probably simple enough for you to try and see, isn't it? ;-)	[reply] [d/l] [select]
Re^2: Why does this hash remove duplicate lines by chromatic (Archbishop) on Mar 06, 2008 at 08:18 UTC
Undefined values are false, and 1 or higher are determined to be true values in boolean context. Undefined values, 0, and the empty string are false. Everything else is true in boolean context -- even negative numbers.	[reply]
Re^3: Why does this hash remove duplicate lines by mr_mischief (Monsignor) on Mar 06, 2008 at 17:19 UTC
Yes, that's true. Negative numbers happen to be irrelevant to the example, which is the reason I didn't mention them. The clarification may help, but I didn't see that as the OP's source of misunderstanding the example.	[reply]
Re: Why does this hash remove duplicate lines by ysth (Canon) on Mar 06, 2008 at 05:31 UTC
$lines{$_}++ counts how many times a line has been seen, but because it is a post-increment, not a pre-increment (see perlop), what the "if not" is checking is the number of times it has been seen before the current time. This number will be 0 the first time only. The "if not" (yes, unless would work instead) is the same as "if 0 == $lines{$_}++". See also http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array%3f -- Online Fortune Cookie Search	[reply]
Re: Why does this hash remove duplicate lines by halfcountplus (Hermit) on Mar 06, 2008 at 05:59 UTC
an interesting technique but all it amounts to is testing for the existence of a hash element, which can also be done (perhaps more intuitively) thus: `my %lines; while (<DATA>) { unless (exists($lines{$_})) {print "$_"} $hash{$_}="done"; # element defined }` [download]	[reply] [d/l]
Re^2: Why does this hash remove duplicate lines by olus (Curate) on Mar 06, 2008 at 11:38 UTC
Nice try, but you should have tested it? `%lines` and `%hash` are two different hashes :). `strict` and `warnings` would have told you that.	[reply] [d/l] [select]