New Novice has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I would like to clean up a file containing some numbers by eliminating duplicates. The numbers are delimited with a new line.

My idea would be to read the file into an array and then write each element (=each number) into a new file unless it has already been processed. To do that, however, I would need to check whether or not a scalar (containing the new number) is identical to an element of an array (containing all numbers already processed). I did not find anything with regard to such a "contained in" function for arrays. Note that this would involve an operation consisting of an array (the checklist) and a scalar (the new number).

Presumably something like this exists (as it is a fundamental concept in mathematics).

Hints would be greatly appreciated!

Replies are listed 'Best First'.
Re: Getting rid of duplicates
by davorg (Chancellor) on Sep 29, 2004 at 14:34 UTC
      The author's problem was to remove duplicate elements from an input file, not an array. While reading the elements from the input file into an array and then applying the FAQ-given solution is one way of solving the author's problem, there are other solutions available that are simpler and more efficient (see my follow-up post for one of them).

      I agree that the FAQ is the first place to look for answers and that we ought to point to it in responses to questions like this. But I also think that we ought to read questions carefully and answer the whole of the questions actually asked, even if the FAQ answers similar questions for us. While it might be easy for some to convert the FAQ's answers into the original questions' answers, that task might be beyond some readers.

      Updated: I removed the last half of this post because it was overly preachy. All I really wanted to say was, When we post pointers to the FAQ, let's also spend a few moments relating it back to the context of the original problem. I'm leaving this big Updated notice here as a humbling reminder to myself to stay off the soapbox.

      Cheers,
      Tom

Re: Getting rid of duplicates
by rlb3 (Deacon) on Sep 29, 2004 at 14:12 UTC
    Hello,

    You may want to use hashes.

    This is untested.
    my %store; foreach (<DATA>) { chomp; $store{$_} = 1; } print join ",", keys %store; __DATA__ 1236 3232 1236 4323 4323
    Something like that may work for you.

    rlb3

      I have a utility function I often use for this sort of thing. See Below:

      my(@numbers) = <DATA>; chomp @number; print join("\n", uniqStrings(@numbers)); sub uniqStrings { my %temp = (); @temp{@_} = (); return keys %temp; } __DATA__ 1236 3232 1236 4323 4323

      That will actually run a little faster because the method used to remove duplicate numbers is pretty well optimized. Hope that helps.

      May the Force be with you
Re: Getting rid of duplicates
by tmoertel (Chaplain) on Sep 29, 2004 at 14:49 UTC
    The following one-liner does what you want and has the advantages of handling the multiple representations available for numbers and preserving the order of the input lines. I expect that both are important to you because you didn't just use sort -nu to solve your problem:
    perl -lne 'print unless $counts{0+$_}++' input.txt > output.txt
    We use the -lne command-line switches to cause Perl to read each line of input, strip off the line break, and then execute the following code on the result:
    print unless $counts{0+$_}++
    The code prints the current line if the count of times we have seen it so far is zero. We use the hash %counts to keep track of the counts. Note the 0+ inside of the hash index. It ensures that the input lines are interpreted as numbers so that, for example, "1" and "1.0" are considered to be the same for the sake of duplicate removal.

    Cheers,
    Tom

Re: Getting rid of duplicates
by periapt (Hermit) on Sep 29, 2004 at 14:19 UTC
    If you are not married to a perl solution, use the unix utility sort -n -u < infile.txt

    if you need a perl solution, try perl -e'my %list = (); $list{$_} = 1 while <>; print sort keys %list; ' < infile.txt


    PJ
    use strict; use warnings; use diagnostics;
Re: Getting rid of duplicates
by Arunbear (Prior) on Sep 29, 2004 at 14:19 UTC
    You can use a hash to 'remember' which numbers have already been seen:
    use strict; use warnings; my %numbers; open my $in, "infile" or die $!; open my $out, ">outfile" or die $!; while (<$in>) { chomp; if (not exists $numbers{$_}) { print $out "$_\n"; $numbers{$_}++; } }
    This method preserves the original order of the numbers. Generally, testing for containment is possible for hashes via the exists function. For arrays you would need to use grep or the first function from List::Util
Re: Getting rid of duplicates
by jZed (Prior) on Sep 29, 2004 at 14:24 UTC
    my(@array,%hash); for (1,2,3,2){push @array,$_ unless $hash{$_}++}; print @array; # prints 123 (no duplicates)
Re: Getting rid of duplicates
by terra incognita (Pilgrim) on Sep 29, 2004 at 18:36 UTC
    Another one using a hash, this is a modified character frequency example from perlretut. This will sort and also handle negative numbers. Comments on where I can improve this code and what practices I should stay away from are appreciated.
    use strict; local $/; my $f = <DATA>; my %chars; $f =~ s/(.+)/$chars{$1}++;$1/eg; # final $1 replaces char with itself print "'$_'\n" foreach (sort {$a <=> $b} keys %chars); __DATA__ 1 1 2 2 3 3 4 5 6 7 8 9 10 11 12 13 -12 -3
Re: Getting rid of duplicates
by johndageek (Hermit) on Sep 29, 2004 at 16:40 UTC
    In my simple mind, if the input file is sorted, I would do the following:

    open in,"file" or die "can not open input file\n"; while (<in>){ print if ($_ ne $prev_record); $prev_record = $_; }

    Please read disclaimers by all monks that contain the word not.

    Enjoy!
    Dageek