in reply to Removing repeated lines from file

The MD5 hash suggestion got me thinking.

It seems like the two big obstacles are 1) the duplicate lines are not necessarily adjacent, and you cannot sort it to make them so, and 2) there's too much data to be held "in place".

What if we could get around obstacle 2? Perhaps if we used some lossless compression on your input, we could reduce it's storage requirement. If the compression is lossless (i.e., the original can be reconstructed with perfect fidelity from it's compressed image), then if we compress two unique lines, their compressed results should also be unique.

Depending on how much compression you are able to get, you may very well be able to process your input "in memory".

OK I guess it really doesn't solve the storage problem per se, just kind of avoids it. It's possible that even with compression, your input stream is just too big.

Replies are listed 'Best First'.
Re: Re: Removing repeated lines from file
by matth (Monk) on Jun 24, 2003 at 14:39 UTC
    As my duplicate lines tend to be clustered I have tried this code:
    while(<INFILE>){ #print "here\n"; if ($_ =~ /^(.{1,50})\t{0,50}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0, +50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\ +t{0,1}(.{0,50})/){ #print "here_B\n"; $bumb = ""; $gene_id = $1; if ($2 =~ /.{1,50}/){ push (@alternative_ids, $2)} else{ push (@alternative_ids, $bumb); } if ($3 =~ /.{1,50}/){ push (@alternative_ids, $3)} else{ push (@alternative_ids, $bumb); } if ($4 =~ /.{1,50}/){ push (@alternative_ids, $4)} else{ push (@alternative_ids, $bumb); } if ($5 =~ /.{1,50}/){ push (@alternative_ids, $5)} else{ push (@alternative_ids, $bumb); } if ($6 =~ /.{1,50}/){ push (@alternative_ids, $6)} else{ push (@alternative_ids, $bumb); } if ($7 =~ /.{1,50}/){ push (@alternative_ids, $7)} else{ push (@alternative_ids, $bumb); } if ($8 =~ /.{1,50}/){ push (@alternative_ids, $8)} else{ push (@alternative_ids, $bumb); } if ($9 =~ /.{1,50}/){ push (@alternative_ids, $9)} else{ push (@alternative_ids, $bumb); } print @alternative_id; foreach (@alternative_ids){ $old_line = $new_line; $alternative_id = $_; $new_line = "$gene_id\t$alternative_id\n"; @record_previous_lines = qw/blank/; if ($new_line ne $old_line){ #discount consecutive repeats of +identical lines $switch = 1; foreach(@record_previous_lines){ if ($_ =~ /$new_line.$old_line/){ $switch = 2; } } if ($switch = 1){ # don't print lines that have previously bee +n repeated print OUT_B "$new_line$old_line"; } if (@record_previous_lines > 200){ # don't let this array get +bigger than 200 shift @record_previous_lines; # shift deletes the first va +lue of an array } push (@record_previous_lines, $new_line.$old_lines); } } undef @alternative_ids; } }
    The last bit doesn't seem to be working as I want. Can anyone thow light on this? Sorry I am not being more specific.

      In

      if ($switch = 1){ # don't print lines that have previously been +repeated print OUT_B "$new_line$old_line"; }
      should that not be
            if ($switch== 1)

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      OK well if you had said "I expect to find duplicate lines within 200 lines of the original" then we could have helped you sooner!

      :)

      So ... how do you want that "last bit" to work, and how is it really working?

        It doen't seem to go to $switch = 2 . Even when I remove the . in the regexp of that loop.

      how clustered are your lines? here are a couple tweakable previous line recognizers.

      # # remember everything, probably uses too much memory. # { my %seen; sub seen_complete { return 1 if exists $seen{$_[0]}; $seen{$_[0]} = (); return 0; } } # # remember last N lines only. # { my %seen; my $remember = 200; my @memory; sub seen_fixed { return 1 if exists $seen{$_[0]}; delete $seen{shift(@memory)} if @memory > $remember; push @memory, $_[0]; $seen{$_[0]} = (); return 0; } } # # remember N buckets of lines with X lines per bucket. # { my @bucket = ( {} ); my $numbuckets = 2; my $bucketsize = 200; sub seen_bucket { foreach (@bucket) { return 1 if exists $_->{$_[0]}; } if (keys %{$bucket[-1]} >= $bucketsize) { shift @bucket if @bucket >= $numbuckets; push @bucket, {}; } $bucket[-1]->{$_[0]} = (); return 0; } }

      i only tested the last one, and only sorta tested it at that.

      while (<>) { print unless seen_bucket($_); } __END__ Ten sets of 1..400 should get uniq'd to 1..400 $ perl -le 'for(1..10){for(1..400){print}}' | perl dup.pl | wc -l 400 Ten sets of 1..401 should get uniq'd to (1..401) x 10 because 2 buckets of 200 lines hold upto 400 $ perl -le 'for(1..10){for(1..401){print}}' | perl dup.pl | wc -l 4010