kangaroobin has asked for the wisdom of the Perl Monks concerning the following question:

I was searching around for a way to remove duplicate lines from a text file that weren't necessarily adjacent to one another, while maintaining the line order. I found this solution by BrowserUk:

#! perl -sw use strict; my %lines; #open DATA, $ARGV[0] or die "Couldn't open $ARGV[0]: $!\n"; while (<DATA>) { print if not $lines{$_}++; } __DATA__ this is a line this is another line yet another and yet another still this is a line more and more and even more this is a line and this and that but not the other cos its a family website:)

It worked for me, but I don't know why. I still don't fully see the magic of hashes. Can someone explain why this works? Specifically the if not $lines{$_}++ structure? I really don't see how incrementing works here. Also, could you replace if not with unless?

Thanks!

Replies are listed 'Best First'.
Re: Why does this hash remove duplicate lines
by mr_mischief (Monsignor) on Mar 06, 2008 at 05:31 UTC
     $lines{$_} is a hash entry, with the value of $_ this time through the loop as the key.

     $lines{$_}++ is incrementing the value of the hash entry with the key of $_.

    Since you're creating the key the first time you see a specific line, the value for that key is undefined. The first time you increment an undefined value in Perl you get 1.

     print if not $lines{$_}++; Undefined values are false, and 1 or higher are determined to be true values in boolean context.

    Basically, it's saying "Note this line. Note this line has been seen again if there's already a note about it. Print the line if it has not been seen before."

    The reason order is preserved is simply because the printing is happening at the time the line is seen and the while loop and diamond operator are reading the lines in order.

    As for whether you can use unless, that's probably simple enough for you to try and see, isn't it? ;-)

      Undefined values are false, and 1 or higher are determined to be true values in boolean context.

      Undefined values, 0, and the empty string are false. Everything else is true in boolean context -- even negative numbers.

        Yes, that's true. Negative numbers happen to be irrelevant to the example, which is the reason I didn't mention them. The clarification may help, but I didn't see that as the OP's source of misunderstanding the example.
Re: Why does this hash remove duplicate lines
by ysth (Canon) on Mar 06, 2008 at 05:31 UTC
Re: Why does this hash remove duplicate lines
by halfcountplus (Hermit) on Mar 06, 2008 at 05:59 UTC

    an interesting technique but all it amounts to is testing for the existence of a hash element, which can also be done (perhaps more intuitively) thus:

    my %lines; while (<DATA>) { unless (exists($lines{$_})) {print "$_"} $hash{$_}="done"; # element defined }

      Nice try, but you should have tested it? %lines and %hash are two different hashes :). strict and warnings would have told you that.