Re: Find duplicate lines from the file and write it into new file.

The two obvious approaches (well, ignoring tied hashes, dbs, etc.) are both crippled - memory requirements will prevent you reading in the whole file (as you've already found), while a line-by-line approach (read a line, check the rest of the file for a duplicate) would be crippling in terms of time... so why not try a hybrid - read in, say, 10,000 lines, check the rest of the file for duplicates, then read the next 10,000?

The following code is not complete, or fully tested (e.g. EOF handling?).

use strict;

my $file = 'data.txt';
my $thiscount;
my $fullcount;
my $max = 10; # change this to, say, 10000
my %lines;

open(INPUT, $file);
while(<INPUT>){
  chomp;
  if(exists $lines{$_}){
    print "duplicate line (on read):$_\n";
  }
  else{
    $lines{$_} = 1;
  }
  $thiscount ++;
  $fullcount ++;
  if($thiscount >= $max){
    my $checkcount=0;
    open(CHECK, $file);
    while(<CHECK>){
      $checkcount ++;
      if($checkcount > $fullcount){
        chomp;
        if(exists $lines{$_}){
          print "duplicate line (on check):$_\n";
        }
      }
    }
    undef %lines;
    $thiscount = 0;
  }
}
[download]

map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2
-$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
[download]

Tom Melly, pm@tomandlu.co.uk

Comment on Re: Find duplicate lines from the file and write it into new file. Select or Download Code