h@kim has asked for the wisdom of the Perl Monks concerning the following question:

I know that comparing 2 files is a typical problem and there are many discussions on this problem. but I have a rather different problem while working with text files:

I have two text files which may differ in number of lines. now I want to compare two files and find the lines which differ. after that I want to tag all the differencies in both of files. for example here are the content of my files:

File1.txt:

This is the first line. This line is just appeared in File1.txt. you can see this line in both files. this line is also appeared in both files. this line and, this one are mereged in File2.txt.

File2.txt:

This is the first line. you can see this line in both files. this line is also appeared in both files. this line and, this one are mereged in File2.txt.

After processing I want both files to be like this:

File1.txt:

This is the first line. <Diff>This line is just appeared in File1.txt.</Diff> you can see this line in both files. this line is also appeared in both files. <Diff>this line and,</Diff> <Diff>this one are merged in File2.txt.</Diff>

File2.txt:

This is the first line. <Diff></Diff> you can see this line in both files. this line is also appeared in both files. <Diff>this line and, this one are mereged in File2.txt.</Diff> <Diff></Diff>

How can I do this? I know that some tools such as diff could help me, but how can I convert their results in this format?

Thank you in advance.

Replies are listed 'Best First'.
Re: Tagging the differencies in both files
by roboticus (Chancellor) on Jun 17, 2012 at 13:19 UTC

    h@kim:

    Did you check http://cpan.org? I did a quick search using "Diff" and found a few modules that look like they'd be useful to you. If you look them over, you may find that one gives you a set of results in a structure that greatly simplifies your task. The ones I thought looked interesting are: String::Diff, Text::Diff, and Algorithm::Diff. The last one looks like it returns the results in a very useful form for tasks like what you're attempting. I think I'll install it and goof around this afternoon.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Tagging the differencies in both files
by zentara (Cardinal) on Jun 17, 2012 at 13:52 UTC
    Hi, the manual way of doing this is to stuff the files into a hash, and see which hash keys have a count greater than 1. Below is the basic outline of such a script, but you would need to expand the hashes to include which file lines came from and possibly their line numbers, which you can get with
    my $line = __LINE__;
    then you would need to rewrite your File1 and File2. Some methods for that are shown in Search Replace String Not Working on text file. Either seek and truncate, or reopen the filehandle with >.
    #!/usr/bin/perl use strict; use warnings; open (FILE1, '<', 'File1.txt') or die "Unable to open file1.txt for re +ading : $!"; open (FILE2, '<', 'File2.txt') or die "Unable to open file2.txt for re +ading : $!"; my %lines; while ( <FILE1> ) { chomp; $lines{$_}++ } while ( <FILE2> ) { chomp; $lines{$_}++ } open (FILE3, '>', 'File3.txt') or die "Unable to open file3.txt for wr +iting : $!"; for ( keys %lines ) { next if $lines{$_} > 1; print FILE3 "$_\n"; }

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
      And how would you handle two files which have the same lines but in a different sequence? Or when the files have multiple identical lines?

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
        Thats why I left it to the OP, :-) , and remarked you would need to expand the hashes to contain all the line data per file. Just as an initial brainstorm, in addition to the hash where you count duplicates, you would have 2 other hashes with each line as key and $filename:linenumber as value. Then in a relatively complex logic loop, you would first find duplicates, then reloop thru each file, testing each line for duplicates, and comparing $filename:linenumbers thru the hash key searches. I'm sure with enough diligence it can be done, because all the information is available in the 3 hashes. Of course, that's just my first thoughts, someone else may know a sweeter way involving less logic. You could also look at tkdiff, it isn't perl tk, but it does color highlighting the way you desire.

        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku ................... flash japh