Re^4: compare two text file line by line, how to optimise

Replies are listed 'Best First'.
Re^5: compare two text file line by line, how to optimise by poj (Abbot) on Feb 28, 2016 at 12:57 UTC
Run this simple program with minimal processing against your data and post the results. This will help eliminate one potential source of your problem (i/o) and provide a better indication of your data than just a size of 50M #!/usr/bin/perl use strict; my $t0 = time; my $file1 = $ARGV[0] \|\| 'ficc.txt'; my $file2 = $ARGV[1] \|\| 'fic.txt'; my $count1=0; my $words1=0; open FICC,'<',$file1 or die "$file1 : $!"; while (<FICC>) { my @words = split /\s+/,lc $_; $words1 += @words; ++$count1; } close FICC; my $count2=0; my $words2=0; open FIC,'<',$file2 or die "$file2 : $!"; while (<FIC>) { my @words = split /\s+/,lc $_; $words2 += @words; ++$count2; } close FICC; my $dur = int time-$t0; print " File1 : $count1 lines $words1 words in $file1 File2 : $count2 lines $words2 words in $file2 Time : $dur seconds\n"; [download] poj	[reply] [d/l]
Re^6: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 13:25 UTC
hi File1 : 3874004 lines 6050371 words in file1 File2 : 4305242 lines 6457863 words in file2 Time : 33 seconds Thanks	[reply]
Re^7: compare two text file line by line, how to optimise by poj (Abbot) on Feb 28, 2016 at 13:47 UTC
Ok, now try this with grep added #!/usr/bin/perl use strict; my $t0 = time; my $file1 = $ARGV[0] \|\| 'ficc.txt'; my $file2 = $ARGV[1] \|\| 'fic.txt'; my %uniq1=(); my $count1=0; my $words1=0; open FICC,'<',$file1 or die "$file1 : $!"; while (<FICC>) { my @words = split /\s+/,lc $_; ++$uniq1{$_} for @words; $words1 += @words; ++$count1; } close FICC; my $uniq1 = scalar keys %uniq1; my %uniq2=(); my $count2=0; my $words2=0; open FIC,'<',$file2 or die "$file2 : $!"; while (my $line = <FIC>) { my @words = split /\s+/,lc $line; ++$uniq2{$_} for @words; $words2 += @words; ++$count2; my @match = grep $uniq1{$_}, @words; } close FICC; my $uniq2 = scalar keys %uniq2; my $dur = int time-$t0; print " File1 : $count1 lines $words1 words $uniq1 unique in $file1 File2 : $count2 lines $words2 words $uniq2 unique in $file2 Time : $dur seconds\n"; [download] These are the results for my i5-2500K File1 : 4000000 lines 7998273 words 6379952 unique in ficc1.txt File2 : 4000000 lines 11999843 words 9364684 unique in fic1.txt Time : 37 seconds poj	[reply] [d/l]
Re^8: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 15:51 UTC
Re^9: compare two text file line by line, how to optimise by poj (Abbot) on Feb 28, 2016 at 16:05 UTC
Some notes below your chosen depth have not been shown here