in reply to Re^4: compare two text file line by line, how to optimise
in thread compare two text file line by line, how to optimise

Run this simple program with minimal processing against your data and post the results. This will help eliminate one potential source of your problem (i/o) and provide a better indication of your data than just a size of 50M

#!/usr/bin/perl use strict; my $t0 = time; my $file1 = $ARGV[0] || 'ficc.txt'; my $file2 = $ARGV[1] || 'fic.txt'; my $count1=0; my $words1=0; open FICC,'<',$file1 or die "$file1 : $!"; while (<FICC>) { my @words = split /\s+/,lc $_; $words1 += @words; ++$count1; } close FICC; my $count2=0; my $words2=0; open FIC,'<',$file2 or die "$file2 : $!"; while (<FIC>) { my @words = split /\s+/,lc $_; $words2 += @words; ++$count2; } close FICC; my $dur = int time-$t0; print " File1 : $count1 lines $words1 words in $file1 File2 : $count2 lines $words2 words in $file2 Time : $dur seconds\n";
poj

Replies are listed 'Best First'.
Re^6: compare two text file line by line, how to optimise
by thespirit (Novice) on Feb 28, 2016 at 13:25 UTC
    hi File1 : 3874004 lines 6050371 words in file1 File2 : 4305242 lines 6457863 words in file2 Time : 33 seconds Thanks

      Ok, now try this with grep added

      #!/usr/bin/perl use strict; my $t0 = time; my $file1 = $ARGV[0] || 'ficc.txt'; my $file2 = $ARGV[1] || 'fic.txt'; my %uniq1=(); my $count1=0; my $words1=0; open FICC,'<',$file1 or die "$file1 : $!"; while (<FICC>) { my @words = split /\s+/,lc $_; ++$uniq1{$_} for @words; $words1 += @words; ++$count1; } close FICC; my $uniq1 = scalar keys %uniq1; my %uniq2=(); my $count2=0; my $words2=0; open FIC,'<',$file2 or die "$file2 : $!"; while (my $line = <FIC>) { my @words = split /\s+/,lc $line; ++$uniq2{$_} for @words; $words2 += @words; ++$count2; my @match = grep $uniq1{$_}, @words; } close FICC; my $uniq2 = scalar keys %uniq2; my $dur = int time-$t0; print " File1 : $count1 lines $words1 words $uniq1 unique in $file1 File2 : $count2 lines $words2 words $uniq2 unique in $file2 Time : $dur seconds\n";

      These are the results for my i5-2500K

      File1 : 4000000 lines 7998273 words 6379952 unique in ficc1.txt
      File2 : 4000000 lines 11999843 words 9364684 unique in fic1.txt
      Time  : 37 seconds
      
      poj

        File1 : 3874004 lines 6050371 words 2413 unique in trans3 File2 : 4305242 lines 6457863 words 2313 unique in gh3-3.n Time : 96 seconds

        i work with an old core2duo T7100, 1.8ghz