Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi there , Can someone tell me which approch to follow when doing this ? I have two file that contain information as follow :
--DATA1-- --DATA2-- Monk Apple One Sam Orange Two Sam Orange Two Sue Apple One Sue Apple One Monk Apple One Mike Bannana One Don Apple Two
What I need to do is compare Data1 file with Data2 file and print the different on Data3 file, I only print the name of the person, so My Data3 will look like this
--DATA3-- Don Mike
I was following an approach of using arrays
my @Found = (); while (my $line = <DATA1>) { chomp $line1; while (my $line2 = <DATA2>) { chomp $line2; if ("$line1" eq "$line2") {} else { push @Found, $line2 } } } open (DATA3, "data3") or die; { print @Found; }
Not finding the difference between the two files. Can someone help ? thanks a lot

Replies are listed 'Best First'.
Re: comparing two files
by Roy Johnson (Monsignor) on Mar 09, 2004 at 15:49 UTC
    If the order of lines isn't important, then you should use hashes:
    Update: refactored to use one hash and to print only the first column.
    my %seen; #open DATA1 here ++$seen{$_} while (<DATA1>); #close DATA1 here #open DATA2 here while (<DATA2>) { # Print any lines that are found, that weren't in DATA1 print((split)[0], "\n") unless (defined(delete $seen{$_})); } # print what's left print((split)[0], "\n") for (keys %seen);
    If this quick and dirty approach isn't what you're looking for, check out the Algorithm::Diff module.

    The PerlMonk tr/// Advocate
Re: comparign two files
by Limbic~Region (Chancellor) on Mar 09, 2004 at 15:55 UTC
    Anonymous Monk,
    You have not mentioned a few things I consider very important. Do you need to know to which file the extra line(s) came from? Can one file contain the same line more than once and is that relavent? Is order important?

    I will give you both standard responses when this question is asked.

    • Use a hash
    • Use diff
    #!/usr/bin/perl use strict; use warnings; open (FILE1, '<', 'file1.txt') or die "Unable to open file1.txt for re +ading : $!"; open (FILE2, '<', 'file2.txt') or die "Unable to open file2.txt for re +ading : $!"; my %lines; while ( <FILE1> ) { chomp; $lines{$_}++ } while ( <FILE2> ) { chomp; $lines{$_}++ } open (FILE3, '>', 'file3.txt') or die "Unable to open file3.txt for wr +iting : $!"; for ( keys %lines ) { next if $lines{$_} > 1; print FILE3 "$_\n"; }
    If you are not on a *nix system with diff or if you have not installed a *nix toolkit for Win32 you can find a pure perl implementation of diff here and sort here. You may also want to look into Perltidy. This question gets asked a lot so you may also want to look at our Q and A section in the future as well.

    Cheers - L~R

Re: comparign two files
by TomDLux (Vicar) on Mar 09, 2004 at 16:03 UTC

    If you are on a Unix platform, comm is the utility you want. It differentiates lines of the two input files into 'unique to file A', 'unique to file B', and 'common to both'. command line flags controls which of those categories are output.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      as the comm(1) manual says you'd have to sort(1) the files first. There are some versions of uniq(1) that will also do it, should comm be missing.

      Sören

Re: comparign two files
by graff (Chancellor) on Mar 10, 2004 at 05:43 UTC
    Your examples didn't make this clear... what would you want as output if your two input files were:
    -- FILE 1 -- -- FILE 2 -- Monk apple one Monk apple one Punk apple two Punk peach two John peach three John peach two Jane plum three Jack plum three
    The question is: Which of the following best describes your task?
    • The comparison consists of using just the first column as the "key" field, and you just want to print the keys that are unique to one file or the other.
    • The comparison involves whole lines -- the first column of a line is printed if the other file does not contain an exact match for the whole line.
    If your task is like the first one, you could check out a command line utility script that I posted here. If it's the latter, a different approach would be needed.