Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I am new to perl and i need to compare two flat files , say file X and file Y.both have similar structure. These files are | delimitted files and have around a million records each.Each record has a key value i.e. the 3rd column of the record. i need to check for all those records which are present in file X and not in file Y and append 'I' to them. Similarly i have to check for all records which are present in file Y and not in file X and append 'D' to them. And also i need to check for all records which are different in file X and file Y and append 'U' to them.there can be any number of fields(columns) in one record(row). Can you please help me here

Replies are listed 'Best First'.
Re: file delta detection
by Anonymous Monk on May 06, 2011 at 10:05 UTC
Re: file delta detection
by anonymized user 468275 (Curate) on May 06, 2011 at 10:34 UTC
    You don't state the capabilities of the machine you are running on or the width of the records. For example, if they are 10K wide, you would need a monstrous machine to load the files into Perl memory.

    So to provide a reasonably general solution, that should work for most machines, I would do quite a lot with unix power tools, starting with:

    awk 'BEGIN{ FS="'" } { print $3 "|" $0 }' < file1 | sort > mod1.sor awk 'BEGIN{ FS="'" } { print $3 }' < file1 | sort > keys1.sor awk 'BEGIN{ FS="'" } { print $3 "|" $0 }' < file2 | sort > mod2.sor awk 'BEGIN{ FS="'" } { print $3 }' < file2 | sort > keys2.sor comm -12 keys1.sor keys2.sor > xk.sor comm -13 mod1.sor mod2.sor > t.sor comm -23 mod1.sor mod2.sor > d.sor
    The xk.sor file has the keys common to both files. t.sor (key column now being appended on the front) contains the mixture of records destined to be either T or U (must be one or the other) and d.sor the mixture of D and U records in the older file (or rather file of older data of the two input files to this process - also now with key appended to front).

    Now you can load xk.sor into Perl as hash keys to identify the 'U' records in t.sor (all others being the T). The xk.sor hash can be used similarly to eliminate the 'U' records from d.sor (all the others being the D. You can remove the key column we appended on the front at output time.

    In regard to Perl language elements needed: hashes to hold the keys, split function to split delimited records into arrays. shift to remove the first element from an array. open to open files for input or output and the <> operator to read from files. Need anything more for this? (update: apart from print to output the lines with their appendages, minus the key-on-the-front which we used to make unix sort work without difficulty - unix sort has a key definition possibility but this is too awkward with delimiters - but unix sort has built-in disk-swapping facilities for huge file processing)

    One world, one people

Re: file delta detection
by fidesachates (Monk) on May 06, 2011 at 12:56 UTC
    My fellow monk has already provided a good perl solution. I just want to add that if you're on a unix box, your problem can be solved with a couple unix commands.
    cut -f3 -d| X | sort > outx.txt; cut -f3 -d| Y | sort > outy.txt; comm outx.txt outy.txt;

    Fair warning, I haven't tested these lines. There may be syntax typos. If there are, just read the man pages for cut, sort, and comm. Enjoy.

    UPDATE: Simon provides a much more encompassing answer with unix solutions.
      There may be syntax typos.

      For sure. You need to put backslash in front of the pipe symbol when you intend to use it as a literal "vertical bar" character (or you could put quotes around it):

      cut -f3 -d\| X | sort > outx.txt cut -f3 -d'|' Y | sort > outy.txt
      Provided that part is done right, I prefer using "cut" rather than "awk" -- it's just easier (and probably faster).
Re: file delta detection
by clueless newbie (Curate) on May 06, 2011 at 16:22 UTC
    All assuming that the key value is unique! Sorry about the long lines ...
    #! # Just For A Lark use strict; use warnings; use Digest::MD5; # Hash for X my %X_h; { open(my $IN,'<','X') or die "Can't open 'X'! $!"; # Populate hash with third field and SizeAndMD5 while (<$IN>) { chomp; $X_h{(split('|'))[2]}=SizeAndMD5($_); }; close($IN) or die "Can't close 'X'! $!"; }; ### %X_h # Processing Y # Hash for Y my %Y_h; { open(my $IN,'<','Y') or die "Can't open 'Y'! $!"; open(my $OUT,'>','Y.new') or die "Can't open 'Y.new'! $!"; # Populate hash with third field and SizeAndMD5 while (<$IN>) { chomp; $Y_h{my $key=(split('|'))[2]}=SizeAndMD5($_); unless (exists $X_h{$key}) { # No key in X so append a "D" print $OUT "$_|D\n"; } elsif ($Y_h{$key} ne $X_h{$key}) { # The corresponding record +in X is different print $OUT "$_|U\n"; } else { # The corresponding record in X is the same print $OUT "$_|\n"; }; }; close($OUT) or die "Can't close 'Y.new'! $!"; close($IN) or die "Can't close 'Y'! $!"; }; ### %Y_h # Re-processing X { open(my $IN,'<','X') or die "Can't open 'X'! $!"; open(my $OUT,'>','X.new') or die "Can't open 'X.new'! $!"; # Populate hash with third field and SizeAndMD5 while (<$IN>) { chomp; my $key=(split('|'))[2]; unless (exists $Y_h{$key}) { # No key in X so append a "I" print $OUT "$_|I\n"; } elsif ($X_h{$key} ne $Y_h{$key}) { # The corresponding record +in Y is different print $OUT "$_|U\n"; } else { # The corresponding record in Y is the same print $OUT "$_|\n"; }; }; close($OUT) or die "Can't close 'X.new'! $!"; close($IN) or die "Can't close 'X'! $!"; }; exit; # Returns Size and MD5 of the string as "%8.8lx%32.32s" sub SizeAndMD5 { return sprintf("%8.8lx%32.32s",length($_[0]),Digest::MD5::md5_ +hex($_[0])); };
Re: file delta detection
by graff (Chancellor) on May 07, 2011 at 18:07 UTC
    The unix-oriented replies will probably help the most, given the quantity of data you have, but in case the size of the key field isn't terribly large, an old perl utility of mine might be useful here: cmpcol.

    If the two input files are really large (lots of wide fields in each row), you probably want to read them just once to get the key distribution info, then just once more to create the updated version of each file. To that end, cmpcol would do the first step like this:

    cmpcol -d '|' -us X:3 Y:3 > xy-keys.union

    (updated to include ":3" as the key-column specifier on each file name)

    The output will have one line per distinct key value, followed by space, then a token to indicate where the key was found, e.g.:

    1st_key_value <1 2nd_key_value <12 3rd_key_value <2 ...
    (Keys are listed in ascii-betic sorted order.) Keys found in both files get <12, keys only in X or only in Y get <1 or <2 respectively. (If a key occurs more than once in a given file, you'll see a "+" next to the file number, i.e.: <+12 or <12+ or <+12+ for "non-unique in file 1, in file 2, in both files", respectively.)

    Once you have that output, it's simple to append "|I" to records in X, and "|D" to records in Y. Adding "|U" (to file Y? to file X? to both files?) is trickier, because you have to work out whether the '<12' keys occur with identical or differing content in the two source files, and depending on the quantity of data, this might cause issues with memory consumption. But it's worth making a first try with an approach that's easy to code up -- for simplicity, I'll assume that the "|U" thing only needs to go into file Y (if it only needs to go into file X, just change the code to read Y first, then X):

    #!/usr/bin/perl use strict; my %key; open( C, '<', 'xy-keys.union' ) or die "xy-keys.union: $!\n"; while (<C>) { chomp; my ( $k, $v ) = ( /^ (.*) < \+? ([12]+) \+? $/x ); # ignore dup.ke +y (+) marks $key{$k} = $v; } my %common; open( I, '<', 'X' ) or die "X: $!\n"; open( O, '>', 'X.out' ) or die "X.out: $!\n"; while (<I>) { my $k = ( split /\|/ )[2]; if ( $key{$k} eq '1' ) { s/$/|I/; } else { $common{$k} = $_; } print O; } open( I, '<', 'Y' ) or die "Y: $!\n"; open( O, '>', 'O.out' ) or die "O.out: $!\n"; while (<I>) { my $k = (split /\|/)[2]; if ( $key{$k} eq '2' ) { s/$/|D/; } elsif ( $_ ne $common{$k} ) { s/$/|U/; } print O; }
    Note that if you have duplicate keys with varying data in the first file (i.e. multiple rows with the same key but different values in other fields), and those keys also show up in the second file, there will probably be trouble. The above approach only keeps track of one row value for a given key.

    (update: added the missing file handle arg 'O' on the print statements)