Delimited files Comparision

harishnuti has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks i need your advice on my below requirement and the suggestions on effort i put to achieve it

* i have 2 delimited files which needs to be compared
* The headers for this delimited files is present in another file
* i need to skip some fields while comparision
* finally i should print the differences between these 2 delimited files i.e. only fields which are different along with header in good report format.
A Extract of File1 shown below
!A90000!,!B00500!,!1!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!2!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!3!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!4!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!I912ST!,!SG!,!30/08/2004!,!288!,!0!
!I912ST!,!SN!,!30/08/2004!,!249!,!0!
!I912ST!,!TR!,!30/08/2004!,!56!,!0!
!I912ST!,!WC!,!30/08/2004!,!211!,!0!

A Extract of file2 shown below
!A90000!,!B00500!,!0!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!3!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!3!,!4!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!4!,!0!,!8!,!23/02/1998!,!2!,!1!,!B005AU!
!I912ST!,!SG!,!30/08/2004!,!248!,!0!
!I912ST!,!SN!,!30/08/2004!,!290!,!0!
!I912ST!,!FZ!,!30/08/2004!,!56!,!0!
!I912ST!,!WC!,!30/08/2004!,!211!,!1!

Header file will be below format
A90000,CT03,1, 1,00240
A90000,CT03,1, 2,00020
A90000,CT03,1, 3,00090
A90000,CT03,1, 4,00690
A90000,CT03,0, 5,01040
A90000,CT03,0, 6,00241
A90000,CT03,0, 7,00243
A90000,CT03,0, 8,00710
A90000,CT03,0, 9,00242
I91200,CW5Y,1, 1,17447
I91200,CW5Y,0, 2,01040
I91200,CW5Y,0, 3,04010
I91200,CW5Y,0, 4,04620
and so on ..... A90000 , I91200 etc , need to generalize this solution
The last column are header for the data in above files
my Code below

# Read header file and store it in hash
open(FILE,"header.txt");
while(<FILE>){
      chomp; next if (/^\s*$/); my @fields = split(/\,/,$_);
      push(@{$hashkeys{$fields[0]}{fields}},$fields[4]);
      # Iam pushing all headers for each table i.e. for field[0] into 
+hash array
}
CLOSE FILE;
# open file1 and file2 and readthem into Hashes %hash1 and %hash2 as b
+elow
open FILE,"file1";
while ( <FILE> ){
     my @arr = split ( /\,/,$_ );
     my $key = join("-->",@arr);
     $hash1{$key} = 1;
}
# llly for file2 i.e. %hash2 contains above.
# The below arrays gives me lines which are different
my @missing1 = grep ! exists $hash1{ $_ }, keys %hash2;
my @missing2 = grep ! exists $hash2{ $_ }, keys %hash1;
# now iam reformatting the arrays w.r.t header file
# Now i want to show output as below..
# Cols which are different in file1 
00240 00020 00090 00690 01040 00241 00243 00710 00242
              1
              2  
                     4
                           8
# Similarly for file2 
# i have done it in lengthy way , with no luck and no formatting
[download]

i can paste the whole code, but its in pretty bad shape, if there is any better way to achieve above formatting, i love to know.
iam happy even if any suggestions are given to achieve above formatting, i can try experimenting in my code

Comment on Delimited files Comparision Download Code

Replies are listed 'Best First'.
Re: Delimited files Comparision by TedPride (Priest) on Jul 28, 2008 at 04:43 UTC
Not exactly sure what the numbers 1, 2, 4, 8 represent, but here's a 5 or 10 minute hack job: use strict; my ($handle, @header, @file1, @file2, $row, $col, @output); ### Load header data open ($handle, 'header.txt') \|\| die "Can't load header file.\n"; while (<$handle>) { @_ = split /,/, $_; $_ = pop @_; chomp; push @header, $_; } close ($handle); ### Load file 1 as array of arrays open ($handle, 'file1.txt') \|\| die "Can't load data file 1.\n"; while (<$handle>) { push @file1, [split /,/, $_]; } close ($handle); ### Load file 2 open ($handle, 'file2.txt') \|\| die "Can't load data file 2.\n"; while (<$handle>) { push @file2, [split /,/, $_]; } close ($handle); ### Output header line print join "\t", @header; print "\n"; ### For each line, output x for fields with a difference for ($row = 0; $row <= $#file1; $row++) { @output = (); for ($col = 0; $col <= $#header; $col++) { push @output, $file1[$row][$col] ne $file2[$row][$col] ? 'x' : ''; } print join "\t", @output; print "\n"; } [download]	[reply] [d/l]
Re^2: Delimited files Comparision by harishnuti (Beadle) on Jul 28, 2008 at 06:56 UTC
Well , thats Amazing, i would like to thanq for effort and spending some time in looking into my problem.. sorry to say , the solution works absolutely fine for the above data, but rather i have 2 more points as below. * The file's shown is only an extract of large file, actual file has got 1 million records of which A9000 is one entry. * The file1 and file2 and header contains different tables data, i have Appended one more set of data .. How do i extend the same solution, Can we use Hash to store and then loop on differences? i have one more requirement, the date field i should skip based on some position in another file, i can handle this one.. can i have some idea on the updated set of data, i will try my best to extend your solution.	[reply]
Re^3: Delimited files Comparision by harishnuti (Beadle) on Jul 30, 2008 at 01:33 UTC
Thanks iam able to resolve the issue using methods of hash	[reply]