Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a large file which contains lines similar to
at12 at13 bn23 bn23
goes on and on. now somewhere in the file most of the lines are duplicated however i am interested in the entries which do not duplicate. I have tried using unix uniq to help with no luck is it possible to use perl to get around this? Thanks

Replies are listed 'Best First'.
Re: removing non-duplicates
by davidrw (Prior) on Jul 11, 2005 at 19:36 UTC
    why didn't unix uniq work? I believe that uniq -u file.txt is what you're looking for. note that the file needs to be sorted first:
    sort file.txt | uniq -u > unique_lines.txt
      although sort file | uniq works; why not just use...
      sort -u

      One world, one people

        beacause that doesn't do what OP wanted. Quotes from the respective man pages:
        • sort: -u does output only the first of an equal run (which means all distinct rows)
        • uniq: -u does only print unique lines (which means all rows that appear exactly once)
        [me@host tmp]$ cat /tmp/t A B A C [me@host tmp]$ sort -u /tmp/t A B C [me@host tmp]$ sort /tmp/t | uniq -u B C
      That gives you one copy of each distinct line. The OP wanted only the lines that appear exactly once. It could be done with uniq -c and grep and cut, but it gets to the point that you just want to do it in Perl. Gar. Should have double-checked that -u option. Good answer.

      Caution: Contents may have been coded under pressure.
        heh. score one more for the *nix cmdline utils :)

        Just for sake of argument/exercise, even if there was only -c i would still do it on the cmdline (also, these are handy if you want lines that show up N times since -u only helps if N==1):
        # using perl: uniq -c /tmp/d | perl -ne '($n,$s)=split(/\t/,$_,2); print $s if $n == + 1' # using grep/cut (make sure that's a real tab after the 1 in the grep) uniq -c /tmp/d | egrep '^ *1 ' | cut -d\t -f2
Re: removing non-duplicates
by Fang (Pilgrim) on Jul 11, 2005 at 19:27 UTC
      is it best to read the file into an array and for each element if it matches then ignore? or a hash, with key and value equal to parts of each line

        It really depends on what you want to do in the end. Do you want to create a new file with all the duplicate entries removed? Do you want to keep one instance of each unique entry? Or do you simply need a report about the entries?

        From what you told us, I'd say there's no need for reading up the entire file in memory, something like the following should do.

        #!/usr/bin/perl use strict; use warnings; my %seen; my $file = "/path/to/your/file"; open(MYFILE, "<", $file) or die "Could not open '$file' for reading: $ +!"; while (<MYFILE>) { $seen{$_}++; } close MYFILE; # Now every unique entry has a value of 1 in the hash %seen print "Unique entries:\n"; print "$_\n" for (grep { $seen{$_} == 1 } keys %seen);
Re: removing non-duplicates
by jdporter (Paladin) on Jul 11, 2005 at 19:43 UTC
    my @lines = <>: my %count; $count{$_}++ for @lines; my @unique_lines = grep { $count{$_} == 1 } @lines; print for @unique_lines; # or whatever you do with the result
    This solution does not require the input to be pre-sorted, and it preserves the original order of the lines printed.
Re: removing non-duplicates
by tlm (Prior) on Jul 12, 2005 at 00:43 UTC

    Here's a command-line variant of other solutions offered:

    % perl -ne 'push @a, $_; $h{$_}++; END { print for grep $h{$_}==1, @a +}' foo.txt
    If you don't care to preserve the original order of lines then:
    % perl -ne '$h{$_}++; END { print for grep $h{$_}==1, keys %h }' foo.t +xt

    the lowliest monk

Re: removing non-duplicates
by gopalr (Priest) on Jul 12, 2005 at 05:28 UTC

    TIMTOWDI

    use List::MoreUtils qw(uniq); use strict; my @array=<DATA>; my @array=uniq(@array); print "\n@array"; __DATA__ at12 at12 at12 at13 bn23 bn23
Re: removing non-duplicates
by anonymized user 468275 (Curate) on Jul 12, 2005 at 09:52 UTC
    Apart from simply adding the -u option to unix sort, here are two perl options:

    1) preserving the existing order:

    perl -e 'while(<>){ $_{$_} or print $_; $_{$_}=1;}' <file
    2) sorted
    perl -e 'while(<>){ $_{$_}=1;} print sort keys %_;' <file
    Note: I deem it to be safe enough to use %_ for quick perl -e usages (without any modules) and not for normal programming where the hash %_ should be replaced by a properly named and declared one.

    One world, one people