Re: compare two text file line by line, how to optimise

The following code builds two 42 MByte test files (2 million lines each) then runs the analysis. The analysis phase takes about three minutes to run.

#!/usr/bin/perl
use strict;
use warnings;

my $kTestLines = 2000000;
my $kMinSame   = 2;
my $testA      = 'testA.txt';
my $testB      = 'testB.txt';

srand (1);
buildTestFile($_) for $testA, $testB;

open my $inA, '<', $testA or die "Can't open $testA: $!\n";
open my $inB, '<', $testB or die "Can't open $testB: $!\n";

my %aKeys;

print scalar localtime, "\n";

while (my $aLine = <$inA>) {
    chomp $aLine;

    my @keys = map {lc} split /\s+/, $aLine;

    push @{$aKeys{$_}}, $. for @keys;
}

while (my $bLine = <$inB>) {
    chomp $bLine;

    my @bWords = map {lc} split (/\s/, $bLine);
    my %lineAHits;

    ++$lineAHits{$_} for map {@{$aKeys{$_}}} grep {exists $aKeys{$_}} 
+@bWords;

    my @matchALines = grep {+$lineAHits{$_} >= $kMinSame} keys %lineAH
+its;

    next if !@matchALines;

    printf "%s:\n    %s\n", join (', ', @matchALines), $bLine;
}

print scalar localtime, "\n";


sub buildTestFile {
    my ($fName) = @_;

    open my $fOut, '>', $fName or die "Can't create '$fName': $!\n";

    for (1 .. $kTestLines) {
        my %words = map {$_ => undef} map {
            join '', map {randLetter()} 1 .. 4
        } 1 .. 4;

        print $fOut join (' ', keys %words), "\n";
    }
}


sub randLetter {
    return chr 65 + rand 26;
}
[download]

Prints (with about 2800 lines omitted):

Sat Feb 27 11:43:38 2016
704856:
    CYXG GWVB OYLX YNWJ
849378:
    ECML DIYS APPF OYLR
...
1090468:
    VSIR CKVJ GWIV IOXN
1327692:
    YJOQ YOJT NCZL VCSA
740815:
    ZZJN WYVG EETN QADD
Sat Feb 27 11:47:08 2016
[download]

Premature optimization is the root of all job security

Comment on Re: compare two text file line by line, how to optimise Select or Download Code

Replies are listed 'Best First'.
Re^2: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 27, 2016 at 12:11 UTC
Thank you for this code, that i don't understand :) I don't understand the utility of sub buildTestFile, please can you explain ? this code is really hard to understand for me. Wht is the utility of %words! that we don't use in any other part of the code, specialy when we use the grep Thank you	[reply]
Re^3: compare two text file line by line, how to optimise by hippo (Archbishop) on Feb 28, 2016 at 10:19 UTC
I don't understand the utility of sub buildTestFile, please can you explain ? GrandFather has posted an SSCCE which is the best way to illustrate some situation in code. Rather than distribute countless MB of data as the input (which would have been rather impolite), the SSCCE builds them on the fly. This is what `buildTestFile` does. Wht is the utility of %words! that we don't use in any other part of the code Using the hash forces uniqueness as this is a property of hash keys.	[reply] [d/l]
Re^4: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 11:09 UTC
Hi But i don't want to test! i have data, and i search for true result and not an random output! when i eliminate the buildTestFile and just use the rest of the code, it take days to treat 50 Mb of data, not what specified in 3 minute. This code is also slow like all the other with my 2GB RAM computer :( Regards	[reply]
Re^5: compare two text file line by line, how to optimise by poj (Abbot) on Feb 28, 2016 at 12:57 UTC
Re^6: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 13:25 UTC
Some notes below your chosen depth have not been shown here
Re^2: compare two text file line by line, how to optimise by thespirit (Novice) on Mar 02, 2016 at 22:41 UTC
I tested this code, i elminate the sub BuildtestFile, becauseit take as i understand only part of the file, and this code is not quick as the authors said! it very slow like the other code, and it take a great amount of the RAM, 900MO, my old code take only 100 MO	[reply]
Re^3: compare two text file line by line, how to optimise by GrandFather (Saint) on Mar 03, 2016 at 06:44 UTC
How long did the test code as written take to run on your computer? From the information you have given us so far it looks like the only answer is to get a modern computer with sufficient memory to allow using the memory to make the task faster. Many ways of making algorithms faster involve using more memory (which is fast) to avoid having to do as much disk I/O (which is slow). If your computer doesn't have enough memory then there may be no way to speed up the processing. That said, if we knew why you are trying to do this search we may be able to suggest a better solution. Premature optimization is the root of all job security	[reply]