compare two text file line by line, how to optimise

thespirit has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that read tow Text files, and compare each line word by word, if the line contain a cerain number of term as intersection , so the program will display the line

#!/usr/bin/perl

print "bonjour\n";
open(FIC, $ARGV[0]);
open(FICC, $ARGV[1]);

  my @a = ();
  my @b = ();
  my $l=0;
  my $v=0;
  my $g=0;
my $h=0;
my $t=0;
my $q=0;
print "choose the outpu file name\n";


chomp(my $fic2=<STDIN>);
open(FIC2, ">$fic2");

#---------------------------------------------------
# initialisation des variables
#---------------------------------------------------

$i=0;
$j=0;
$u=0;
$v=0;
$t=0;
$kk=0;
$total=0;

while (<FICC>) {
my $ligne=$_;

$b[$i]=lc($ligne);




$i++;
}




while (<FIC>) {
my $ligne=$_;

$a[$j]=lc($ligne);



$j++;
}

foreach my $che(@b){

chomp($che);


@aa=split(/\s/,$che);
$u++;




foreach my $kh(@a){


 
chomp($kh);

@bb=split(/\s/,$kh);




$v++;

 $t=0;$total=0;
for ($l=0;$l<=$#bb;$l++){



for ($m=0;$m<=$#aa;$m++){

# here compare the word of each line

if(($bb[$l] eq $aa[$m]) ){

$t++;

$m++;
$kk=1;

# if the tow termes are identical so $kk=1;
goto che
}


}
che:

if($kk==1)
{

#calculate the number of identic terms per line with $number

$total++; 
}

$kk=0;
}
 #print the retrieved line
 print FIC2 "$u: $kh\n";
}

$v=0;

}
[download]

the problem with this code it is too slow with file about 50 MO, how to speed up this code thank you

The File1 contain many line for example:</p>

chirac prime paris
              
         
chirac prime jacques
       
         
chirac prime president 
      
         
chirac paris france
       
         
chirac paris french
[download]

The File 2:


chirac presidential migration 


chirac presidential paris 


chirac prime president
 

chirac presidential 007
 

chirac paris migration 


chirac paris french
[download]

 output

1: chirac prime paris
1: chirac prime jacques
1: chirac prime president 
1: chirac paris france
1: chirac paris french
2: chirac prime paris
2: chirac prime jacques
2: chirac prime president 
2: chirac paris france
2: chirac paris french
3: chirac prime paris
3: chirac prime jacques
3: chirac prime president 
3: chirac paris france
3: chirac paris french
4: chirac prime paris
4: chirac prime jacques
4: chirac prime president 
4: chirac paris france
4: chirac paris french
5: chirac prime paris
5: chirac prime jacques
5: chirac prime president 
5: chirac paris france
5: chirac paris french
6: chirac prime paris
6: chirac prime jacques
6: chirac prime president 
6: chirac paris france
6: chirac paris french
[download]

Comment on compare two text file line by line, how to optimise Select or Download Code

Replies are listed 'Best First'.
Re: compare two text file line by line, how to optimise by NetWallah (Canon) on Feb 24, 2016 at 22:02 UTC
This code could be considerably shortened, and made easier to read, as well as faster, if you re-do it in idomatic perl. To get you started in that direction: * Use meaningful variable names * Error check each file open by adding open(...) or die "Could not open <filename>:$!"; * Do not repeatedly loop over the same arrays Do all necessary processing in a single loop over the array. * Use appropriate data structures: Use a hash for word lookups . * Use subroutines for repeated code. In addition, for questions to PM, be clear on what the inputs, and expected outputs look like. Also, a clear problem statement, and a polite request would probably induce someone to write it for you, since we have seen that you have put effort into doing this. "Think of how stupid the average person is, and realize half of them are stupider than that." - George Carlin	[reply]
Re^2: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 26, 2016 at 11:29 UTC
Hi Sorry if my message diturbed you! Because in other forum, the moderator don't accept terms like hello, hi. I was wondering that also in this formum they directly want the subject and the question Regards	[reply]
Re^3: compare two text file line by line, how to optimise by NetWallah (Canon) on Feb 26, 2016 at 18:59 UTC
"Hello", "Hi" and other greetings are acceptable, but not necessary, as a part of the message content. Avoid putting them in the SUBJECT, because they distract at a place where you need to be concise. "Think of how stupid the average person is, and realize half of them are stupider than that." - George Carlin	[reply]
Re: compare two text file line by line, how to optimise by Discipulus (Canon) on Feb 24, 2016 at 22:26 UTC
welcome thespirit perltidy can beuatify your code in a blink but no one but you can put more effort when choosing variable names. I tell this because occured to me, and passed some weeks, i never could re-understand my own code. Even more important if someone else looks at your code: a well choosed name can do the differnce and few keystroke more are worth in the distance. Anyway if you just want intersection of two lists is faq, see How can I find the union/difference/intersection of two arrays?. Using `exists` on a hash list of keys can be usefull. If i've understood your question you can try something like `# untested while (<$first_fh>){ chomp; my %infirst; # every word become a key of temporary hash map{$infirst{$_}++} split ' ',$_; my $ref_line = <$second_fh>; chomp $ref_line; print grep {exists $infirst{$_}} split ' ',$ref_line; die "$file_one exhausted!" if eof($first_fh); die "$file_two exhausted!" if eof($second_fh); }` [download] For large files be awere to never evaluate `<$fh>` il list context: see recent Re: How to read in large files L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re: compare two text file line by line, how to optimise by GrandFather (Saint) on Feb 25, 2016 at 02:50 UTC
Cleaned up and using a test harness (i.e. some internal test data and print to stdout) you code looks like this: #!/usr/bin/perl use strict; use warnings; my $file1 = <<FILE; chirac prime paris chirac prime jacques chirac prime president chirac paris france chirac paris french FILE my $file2 = <<FILE; chirac presidential migration chirac presidential paris chirac prime president chirac presidential 007 chirac paris migration chirac paris french FILE #open my $inA, '<', $ARGV[0] or die "Can;t open $ARGV[0]: $!\n"; #open my $inB, '<', $ARGV[1] or die "Can;t open $ARGV[0]: $!\n"; open my $inA, '<', \$file1; open my $inB, '<', \$file2; #print "bonjour\n"; #print "choose the output file name\n"; # #chomp(my $fic2 = <STDIN>); #open my $outFile, '>', $fic2 or die "Can't create $fic2: $!\n"; my @aLines; while (my $ligne = <$inA>) { chomp $ligne; push @aLines, lc($ligne); } while (my $che = <$inB>) { chomp $che; my @bWords = split(/\s/, $che); foreach my $kh (@aLines) { my @aWords = split(/\s/, $kh); my $total = 0; for my $bWord (@bWords) { my $matched; for my $aWord (@aWords) { $matched = $bWord eq $aWord; last if $matched; } $total++ if $matched; } #print the retrieved line #print $outFile "$.: $kh\n"; print "$.: $kh\n"; } } [download] Prints: 1: chirac prime paris 1: chirac prime jacques 1: chirac prime president 1: chirac paris france 1: chirac paris french 2: chirac prime paris 2: chirac prime jacques 2: chirac prime president 2: chirac paris france 2: chirac paris french 3: chirac prime paris 3: chirac prime jacques 3: chirac prime president 3: chirac paris france 3: chirac paris french 4: chirac prime paris 4: chirac prime jacques 4: chirac prime president 4: chirac paris france 4: chirac paris french 5: chirac prime paris 5: chirac prime jacques 5: chirac prime president 5: chirac paris france 5: chirac paris french 6: chirac prime paris 6: chirac prime jacques 6: chirac prime president 6: chirac paris france 6: chirac paris french [download] which doesn't work as advertised. Making this code generate the wrong answer more quickly is possible, but probably not what you actually want to do! Maybe you should more fully describe what it is you want to achieve? Are you looking for matching lines (in which case the word matching stuff and nested loops is bogus), or do you want to match lines that have some minimum number of matching words, or something else? We can't tell unless you tell us. Tell us why you are doing this and we may be able to make better guesses! Premature optimization is the root of all job security	[reply] [d/l] [select]
Re^2: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 26, 2016 at 14:26 UTC
Thank again, what i want exactly is to match lines that have minimum number of matching words	[reply]
Re^2: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 26, 2016 at 11:28 UTC
Thank you for your replay Your code is well writing and concise :) This is what my code do, and the output that i wrote in the post is an error. I tested your code , it is also slow like my code, because it do exactly the same processing. What do you think if we stock the second file or both fiels in a hash of table	[reply]
Re^3: compare two text file line by line, how to optimise by poj (Abbot) on Feb 26, 2016 at 12:27 UTC
the output that i wrote in the post is an error But you haven't shown what the correct output should be so we can only guess what you are trying to do. Here's my guess, matching a combination of words from FIC with lines in FICC #!/usr/bin/perl use strict; my @FIC = (); #open FIC,'<','fic.txt' or die "$!"; #while (my $line = <FIC>){ # next unless $line =~ /\S/; # my @words = split /\s+/,$line; # push @FIC,[ @words ]; #} #close FIC; @FIC = ( [ qw(chirac prime paris)], [ qw(chirac prime jacques) ], [ qw(chirac prime president) ], [ qw(chirac paris france) ], [ qw(chirac paris french) ], ); my $u=0; open FICC,'<','ficc.txt' or die "$!"; #open OUT, '>','output.txt' or die "$!"; while (my $line = <FICC>){ ++$u; next unless $line =~ /\S/; # skip blank lines for my $ar (@FIC){ my @matched = grep $line=~/$_/,@$ar; if (@matched == @$ar){ print "$u: $line matched all words : @matched\n\n"; #print OUT "$u: $line matched all words : @matched\n\n"; last; } } } close FICC; #close OUT __DATA__ chirac presidential migration chirac presidential paris jacques chirac has been the prime minster and the president chirac presidential 007 chirac paris migration chirac aaa french bbb paris ccc [download] poj	[reply] [d/l]
Re^4: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 26, 2016 at 14:23 UTC
Re^5: compare two text file line by line, how to optimise by poj (Abbot) on Feb 27, 2016 at 13:30 UTC
Re: compare two text file line by line, how to optimise by stevieb (Canon) on Feb 24, 2016 at 21:20 UTC
Welcome to the Monastery, thespirit! Please provide a small sample of your input file, and an example of expected output. Put these in code tags as you've done with your code. It'd also help us (and you) if you used more sane indenting. Typically this is one indent per block (a tab usually, often set to 4 spaces, but that part isn't necessary).	[reply]
Re: compare two text file line by line, how to optimise by GrandFather (Saint) on Feb 26, 2016 at 23:44 UTC
The following code builds two 42 MByte test files (2 million lines each) then runs the analysis. The analysis phase takes about three minutes to run. #!/usr/bin/perl use strict; use warnings; my $kTestLines = 2000000; my $kMinSame = 2; my $testA = 'testA.txt'; my $testB = 'testB.txt'; srand (1); buildTestFile($_) for $testA, $testB; open my $inA, '<', $testA or die "Can't open $testA: $!\n"; open my $inB, '<', $testB or die "Can't open $testB: $!\n"; my %aKeys; print scalar localtime, "\n"; while (my $aLine = <$inA>) { chomp $aLine; my @keys = map {lc} split /\s+/, $aLine; push @{$aKeys{$_}}, $. for @keys; } while (my $bLine = <$inB>) { chomp $bLine; my @bWords = map {lc} split (/\s/, $bLine); my %lineAHits; ++$lineAHits{$_} for map {@{$aKeys{$_}}} grep {exists $aKeys{$_}} +@bWords; my @matchALines = grep {+$lineAHits{$_} >= $kMinSame} keys %lineAH +its; next if !@matchALines; printf "%s:\n %s\n", join (', ', @matchALines), $bLine; } print scalar localtime, "\n"; sub buildTestFile { my ($fName) = @_; open my $fOut, '>', $fName or die "Can't create '$fName': $!\n"; for (1 .. $kTestLines) { my %words = map {$_ => undef} map { join '', map {randLetter()} 1 .. 4 } 1 .. 4; print $fOut join (' ', keys %words), "\n"; } } sub randLetter { return chr 65 + rand 26; } [download] Prints (with about 2800 lines omitted): `Sat Feb 27 11:43:38 2016 704856: CYXG GWVB OYLX YNWJ 849378: ECML DIYS APPF OYLR ... 1090468: VSIR CKVJ GWIV IOXN 1327692: YJOQ YOJT NCZL VCSA 740815: ZZJN WYVG EETN QADD Sat Feb 27 11:47:08 2016` [download] Premature optimization is the root of all job security	[reply] [d/l] [select]
Re^2: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 27, 2016 at 12:11 UTC
Thank you for this code, that i don't understand :) I don't understand the utility of sub buildTestFile, please can you explain ? this code is really hard to understand for me. Wht is the utility of %words! that we don't use in any other part of the code, specialy when we use the grep Thank you	[reply]
Re^3: compare two text file line by line, how to optimise by hippo (Archbishop) on Feb 28, 2016 at 10:19 UTC
I don't understand the utility of sub buildTestFile, please can you explain ? GrandFather has posted an SSCCE which is the best way to illustrate some situation in code. Rather than distribute countless MB of data as the input (which would have been rather impolite), the SSCCE builds them on the fly. This is what `buildTestFile` does. Wht is the utility of %words! that we don't use in any other part of the code Using the hash forces uniqueness as this is a property of hash keys.	[reply] [d/l]
Re^4: compare two text file line by line, how to optimise by thespirit (Novice) on Feb 28, 2016 at 11:09 UTC
Re^5: compare two text file line by line, how to optimise by poj (Abbot) on Feb 28, 2016 at 12:57 UTC
Some notes below your chosen depth have not been shown here
Re^2: compare two text file line by line, how to optimise by thespirit (Novice) on Mar 02, 2016 at 22:41 UTC
I tested this code, i elminate the sub BuildtestFile, becauseit take as i understand only part of the file, and this code is not quick as the authors said! it very slow like the other code, and it take a great amount of the RAM, 900MO, my old code take only 100 MO	[reply]
Re^3: compare two text file line by line, how to optimise by GrandFather (Saint) on Mar 03, 2016 at 06:44 UTC
How long did the test code as written take to run on your computer? From the information you have given us so far it looks like the only answer is to get a modern computer with sufficient memory to allow using the memory to make the task faster. Many ways of making algorithms faster involve using more memory (which is fast) to avoid having to do as much disk I/O (which is slow). If your computer doesn't have enough memory then there may be no way to speed up the processing. That said, if we knew why you are trying to do this search we may be able to suggest a better solution. Premature optimization is the root of all job security	[reply]