Baratski has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone.

I'm trying to remove duplicates from an array using the following:

undef %saw; @out = grep(! $saw{$_}++, @array);

When I loop through @out, I find that duplicates are still there.

@array contain paths that have been pushed into it. Each path has an identicle root like so:

Root/abc/file.............
Root/def/ghijk/file.............
Root/file.............

Could the "/" delimiter be messing things up for me? Also, if you'll notice the "............." at the end of each path, these are packed records from a binary file. There is some arbitrary binary data at the end of each record. I've tried unpacking each record into a hex string before pushing them into @array but that didn't seem to help either. Any insight much appreciated. Thanks.

Replies are listed 'Best First'.
Re: grep confusion
by Zaxo (Archbishop) on Oct 01, 2005 at 19:33 UTC

    Your code looks fine. The slashes will not interfere with forming hash keys.

    The problem appears to be that the trailing data is making each element of @array unique. If there is no way of distinguishing where a path ends and data starts, you have an intractable problem. Ideally, the data should start with ASCII NUL, which cannot be part of a file name. Then,

    my %saw; @out = grep {!$saw{$_}++} map {substr $_, 0, index($_,"\0")} @array;
    If you don't have that luxury, can you characterize what the filename extensions are? Matching those can locate the end of the path for you.

    After Compline,
    Zaxo

Re: grep confusion
by polypompholyx (Chaplain) on Oct 01, 2005 at 19:30 UTC
    The code you posted should work. I think the items you believe should be duplicates are not: the data at the end of the string makes them different: "Root/file\0\0" ne "Root/file\0". I think the bug lies in your unpacking, or that you need to do some regexing or similar to remove the irrelevant junk on the end of the items of @array.
Re: grep confusion
by Baratski (Acolyte) on Oct 02, 2005 at 10:17 UTC
    The consensus is valid. The code is correct, and I should have cleaned up the records before storing them. For the life of me I don't know why I didn't realize this. You all, once again, have shown me the light. Thank you.

    BTW,

    $rec =~ s/\W+$//;

    Just above shows how I snipped of the arbitrary data at the end of each record.

    Thanks again. :-)

      Keep in mind that if there's a \w character in the middle of the binary data, that won't work:

      $ perl|od -tx1 my $c=65.66.20.21.67.23.25.10; $c =~ s/\W+$//; print "$c\n"; __OUTPUT__ 0000000 41 42 14 15 43 0a 0000006

      Note that the two bytes (14 and 15) between "B" (42) and "C" (43) are still there.

      --
      David Serrano

Re: grep confusion
by ambrus (Abbot) on Oct 02, 2005 at 08:12 UTC

    As others, I think that this code should work. Is it possible that the duplicates are not real duplicates, because there is some difference in the binary data after the paths?