in reply to compare two text file line by line, how to optimise
The following code builds two 42 MByte test files (2 million lines each) then runs the analysis. The analysis phase takes about three minutes to run.
#!/usr/bin/perl use strict; use warnings; my $kTestLines = 2000000; my $kMinSame = 2; my $testA = 'testA.txt'; my $testB = 'testB.txt'; srand (1); buildTestFile($_) for $testA, $testB; open my $inA, '<', $testA or die "Can't open $testA: $!\n"; open my $inB, '<', $testB or die "Can't open $testB: $!\n"; my %aKeys; print scalar localtime, "\n"; while (my $aLine = <$inA>) { chomp $aLine; my @keys = map {lc} split /\s+/, $aLine; push @{$aKeys{$_}}, $. for @keys; } while (my $bLine = <$inB>) { chomp $bLine; my @bWords = map {lc} split (/\s/, $bLine); my %lineAHits; ++$lineAHits{$_} for map {@{$aKeys{$_}}} grep {exists $aKeys{$_}} +@bWords; my @matchALines = grep {+$lineAHits{$_} >= $kMinSame} keys %lineAH +its; next if !@matchALines; printf "%s:\n %s\n", join (', ', @matchALines), $bLine; } print scalar localtime, "\n"; sub buildTestFile { my ($fName) = @_; open my $fOut, '>', $fName or die "Can't create '$fName': $!\n"; for (1 .. $kTestLines) { my %words = map {$_ => undef} map { join '', map {randLetter()} 1 .. 4 } 1 .. 4; print $fOut join (' ', keys %words), "\n"; } } sub randLetter { return chr 65 + rand 26; }
Prints (with about 2800 lines omitted):
Sat Feb 27 11:43:38 2016 704856: CYXG GWVB OYLX YNWJ 849378: ECML DIYS APPF OYLR ... 1090468: VSIR CKVJ GWIV IOXN 1327692: YJOQ YOJT NCZL VCSA 740815: ZZJN WYVG EETN QADD Sat Feb 27 11:47:08 2016
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: compare two text file line by line, how to optimise
by thespirit (Novice) on Feb 27, 2016 at 12:11 UTC | |
by hippo (Archbishop) on Feb 28, 2016 at 10:19 UTC | |
by thespirit (Novice) on Feb 28, 2016 at 11:09 UTC | |
by poj (Abbot) on Feb 28, 2016 at 12:57 UTC | |
by thespirit (Novice) on Feb 28, 2016 at 13:25 UTC | |
| |
|
Re^2: compare two text file line by line, how to optimise
by thespirit (Novice) on Mar 02, 2016 at 22:41 UTC | |
by GrandFather (Saint) on Mar 03, 2016 at 06:44 UTC |