Re^5: compare two text file line by line, how to optimise

Run this simple program with minimal processing against your data and post the results. This will help eliminate one potential source of your problem (i/o) and provide a better indication of your data than just a size of 50M

#!/usr/bin/perl
use strict;
my $t0 = time;

my $file1 = $ARGV[0] || 'ficc.txt';
my $file2 = $ARGV[1] || 'fic.txt';

my $count1=0; my $words1=0;
open FICC,'<',$file1 or die "$file1 : $!";
while (<FICC>) {
  my @words = split /\s+/,lc $_;
  $words1 += @words;
  ++$count1;
}
close FICC;

my $count2=0; my $words2=0;
open FIC,'<',$file2 or die "$file2 : $!";
while (<FIC>) {
  my @words = split /\s+/,lc $_;
  $words2 += @words;
  ++$count2;
}
close FICC;

my $dur = int time-$t0;
print "
File1 : $count1 lines $words1 words in $file1
File2 : $count2 lines $words2 words in $file2
Time  : $dur seconds\n";
[download]

poj

Comment on Re^5: compare two text file line by line, how to optimise Download Code

Replies are listed 'Best First'.
Re^6: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 13:25 UTC
hi File1 : 3874004 lines 6050371 words in file1 File2 : 4305242 lines 6457863 words in file2 Time : 33 seconds Thanks	[reply]
Re^7: compare two text file line by line, how to optimise by poj (Abbot) on Feb 28, 2016 at 13:47 UTC
Ok, now try this with grep added #!/usr/bin/perl use strict; my $t0 = time; my $file1 = $ARGV[0] \|\| 'ficc.txt'; my $file2 = $ARGV[1] \|\| 'fic.txt'; my %uniq1=(); my $count1=0; my $words1=0; open FICC,'<',$file1 or die "$file1 : $!"; while (<FICC>) { my @words = split /\s+/,lc $_; ++$uniq1{$_} for @words; $words1 += @words; ++$count1; } close FICC; my $uniq1 = scalar keys %uniq1; my %uniq2=(); my $count2=0; my $words2=0; open FIC,'<',$file2 or die "$file2 : $!"; while (my $line = <FIC>) { my @words = split /\s+/,lc $line; ++$uniq2{$_} for @words; $words2 += @words; ++$count2; my @match = grep $uniq1{$_}, @words; } close FICC; my $uniq2 = scalar keys %uniq2; my $dur = int time-$t0; print " File1 : $count1 lines $words1 words $uniq1 unique in $file1 File2 : $count2 lines $words2 words $uniq2 unique in $file2 Time : $dur seconds\n"; [download] These are the results for my i5-2500K File1 : 4000000 lines 7998273 words 6379952 unique in ficc1.txt File2 : 4000000 lines 11999843 words 9364684 unique in fic1.txt Time : 37 seconds poj	[reply] [d/l]
Re^8: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 15:51 UTC
File1 : 3874004 lines 6050371 words 2413 unique in trans3 File2 : 4305242 lines 6457863 words 2313 unique in gh3-3.n Time : 96 seconds i work with an old core2duo T7100, 1.8ghz	[reply]
Re^9: compare two text file line by line, how to optimise by poj (Abbot) on Feb 28, 2016 at 16:05 UTC
Re^10: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 16:31 UTC
Some notes below your chosen depth have not been shown here