Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

on matching content of two text files

by sarvan (Sexton)
on Aug 11, 2011 at 05:29 UTC ( #919798=perlquestion: print w/replies, xml ) Need Help??

sarvan has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I have the following script to compare two text files.
#!/usr/bin/perl use strict; use warnings; my $f1 = $ARGV[0]; open FILE1, "$f1" or die "Could not open file \n"; my $f2= 'res1.txt'; open FILE2, "$f2" or die "Could not open file \n"; my $outfile = $ARGV[1]; my @outlines; my $n=0; my $n1=0; my $n2=0; foreach (<FILE1>) { my $y = 0; my $outer_text = $_; seek(FILE2,0,0); foreach (<FILE2>) { my $inner_text = $_; if($outer_text eq $inner_text) { $y = 1; print "$outer_text, Match found \n"; $n++; last; } } if($y != 1) { print "$outer_text,No Match Found \n"; push(@outlines, $outer_text); $n1++; } $n2++; } print "Total No.of queries:$n2\n"; print "No.of matched entries:$n\n"; print "No.of Mis-matched entries:$n1\n"; my $precision=$n/$n2; print "The precision is:$precision\n"; open (OUTFILE, ">Nonmatch") or die "Cannot open $outfile for writing \ +n"; print OUTFILE @outlines; close OUTFILE; close FILE1; close FILE2;

The script compares the two text file and gives the no.of matches and no.of mismatches. Also it writes mis-match into seperate file.

My doubt is when comparing the two file, if the 1st file has a sentence in line2(for e.g) and the 2nd file has empty space at line2, what happens is it takes that also has a mis-match. But what i want is, i want to differetiate the no.result(i.e empty space) from mis-match..

How can i do that.. Plz suggest me in this... Thanks..

Replies are listed 'Best First'.
Re: on matching content of two text files
by jethro (Monsignor) on Aug 11, 2011 at 09:12 UTC

    You know that the time your algorithm takes grows with the power of 2? And you reread one of the files for every single line of the other file? As longs as the two files are small that's ok, but as soon as the second file grows larger than your disk cache in memory you will get running times of hours.

    Also what happens if one file has a line "x" and the other file has 100 lines "x"? Your algorithm will note that as a match even though 99 of the "x" lines in one file have no corresponding line in the other.

    You want to differentiate empty lines from different lines at the same line number. At the same time you compare *all* lines in one file with any line in the other file. How do you want to count this? Is a mismatch on the same line, but a match in a different line worth 1 point (as you have now), but a match in the same line 2 points worth? And what then is the worth of a line where the same line is empty? And what does the summary at the end then tell you except a rather meaninless number ?

    Ok, first suggestion, use the diff utility (always installed on any unix dialect, but should be availabel for windows too) or a Diff CPAN module as someone else suggested. If not, think carefully what you want. If you really want to compare any line of one file with any line of the other and the file sizes are smaller than GBytes, use a hash to store one file, i.e.

    my %file1; my $linenumber= 0; foreach (<FILE1>) { $file1{$_}= $linenumber++; }

    Then you can use the hash to find any line in the other file and it even tells you the line number where that line was found. But if you have the same line multiple times in that file, it will only tell you the last line it was found. Some more effort (more complicated data structures) would be necessary to differentiate between them

Re: on matching content of two text files
by Skeeve (Parson) on Aug 11, 2011 at 06:10 UTC

    Why not use diff?

    Or one of the Diff modules on CPAN?


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://919798]
Approved by ww
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2023-09-25 09:40 GMT
Find Nodes?
    Voting Booth?

    No recent polls found