grep confusion

Baratski has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone.

I'm trying to remove duplicates from an array using the following:

undef %saw;
@out = grep(! $saw{$_}++, @array);
[download]

When I loop through @out, I find that duplicates are still there.

@array contain paths that have been pushed into it. Each path has an identicle root like so:

Root/abc/file.............
Root/def/ghijk/file.............
Root/file.............

Could the "/" delimiter be messing things up for me? Also, if you'll notice the "............." at the end of each path, these are packed records from a binary file. There is some arbitrary binary data at the end of each record. I've tried unpacking each record into a hex string before pushing them into @array but that didn't seem to help either. Any insight much appreciated. Thanks.

Comment on grep confusion Download Code

Replies are listed 'Best First'.
Re: grep confusion by Zaxo (Archbishop) on Oct 01, 2005 at 19:33 UTC
Your code looks fine. The slashes will not interfere with forming hash keys. The problem appears to be that the trailing data is making each element of @array unique. If there is no way of distinguishing where a path ends and data starts, you have an intractable problem. Ideally, the data should start with ASCII NUL, which cannot be part of a file name. Then, `my %saw; @out = grep {!$saw{$_}++} map {substr $_, 0, index($_,"\0")} @array;` [download] If you don't have that luxury, can you characterize what the filename extensions are? Matching those can locate the end of the path for you. After Compline, Zaxo	[reply] [d/l]
Re: grep confusion by polypompholyx (Chaplain) on Oct 01, 2005 at 19:30 UTC
The code you posted should work. I think the items you believe should be duplicates are not: the data at the end of the string makes them different: `"Root/file\0\0" ne "Root/file\0"`. I think the bug lies in your unpacking, or that you need to do some regexing or similar to remove the irrelevant junk on the end of the items of `@array`.	[reply] [d/l] [select]
Re: grep confusion by Baratski (Acolyte) on Oct 02, 2005 at 10:17 UTC
The consensus is valid. The code is correct, and I should have cleaned up the records before storing them. For the life of me I don't know why I didn't realize this. You all, once again, have shown me the light. Thank you. BTW, `$rec =~ s/\W+$//;` [download] Just above shows how I snipped of the arbitrary data at the end of each record. Thanks again. :-)	[reply] [d/l]
Re^2: grep confusion by Hue-Bond (Priest) on Oct 02, 2005 at 18:51 UTC
Keep in mind that if there's a \w character in the middle of the binary data, that won't work: `$ perl\|od -tx1 my $c=65.66.20.21.67.23.25.10; $c =~ s/\W+$//; print "$c\n"; __OUTPUT__ 0000000 41 42 14 15 43 0a 0000006` [download] Note that the two bytes (14 and 15) between "B" (42) and "C" (43) are still there. -- David Serrano	[reply] [d/l]
Re: grep confusion by ambrus (Abbot) on Oct 02, 2005 at 08:12 UTC
As others, I think that this code should work. Is it possible that the duplicates are not real duplicates, because there is some difference in the binary data after the paths?	[reply]