Re: file delta detection

The unix-oriented replies will probably help the most, given the quantity of data you have, but in case the size of the key field isn't terribly large, an old perl utility of mine might be useful here: cmpcol.

If the two input files are really large (lots of wide fields in each row), you probably want to read them just once to get the key distribution info, then just once more to create the updated version of each file. To that end, cmpcol would do the first step like this:

cmpcol -d '|' -us X:3 Y:3 > xy-keys.union
[download]

(updated to include ":3" as the key-column specifier on each file name)

The output will have one line per distinct key value, followed by space, then a token to indicate where the key was found, e.g.:

1st_key_value <1
2nd_key_value <12
3rd_key_value <2
...
[download]

(Keys are listed in ascii-betic sorted order.) Keys found in both files get <12, keys only in X or only in Y get <1 or <2 respectively. (If a key occurs more than once in a given file, you'll see a "+" next to the file number, i.e.: <+12 or <12+ or <+12+ for "non-unique in file 1, in file 2, in both files", respectively.)

Once you have that output, it's simple to append "|I" to records in X, and "|D" to records in Y. Adding "|U" (to file Y? to file X? to both files?) is trickier, because you have to work out whether the '<12' keys occur with identical or differing content in the two source files, and depending on the quantity of data, this might cause issues with memory consumption. But it's worth making a first try with an approach that's easy to code up -- for simplicity, I'll assume that the "|U" thing only needs to go into file Y (if it only needs to go into file X, just change the code to read Y first, then X):

#!/usr/bin/perl
use strict;

my %key;
open( C, '<', 'xy-keys.union' ) or die "xy-keys.union: $!\n";
while (<C>) {
    chomp;
    my ( $k, $v ) = ( /^ (.*) < \+? ([12]+) \+? $/x ); # ignore dup.ke
+y (+) marks
    $key{$k} = $v;
}

my %common;
open( I, '<', 'X' ) or die "X: $!\n";
open( O, '>', 'X.out' ) or die "X.out: $!\n";
while (<I>) {
    my $k = ( split /\|/ )[2];
    if ( $key{$k} eq '1' ) {
        s/$/|I/;
    }
    else {
        $common{$k} = $_;
    }
    print O;
}
open( I, '<', 'Y' ) or die "Y: $!\n";
open( O, '>', 'O.out' ) or die "O.out: $!\n";
while (<I>) {
    my $k = (split /\|/)[2];
    if ( $key{$k} eq '2' ) {
        s/$/|D/;
    }
    elsif ( $_ ne $common{$k} ) {
        s/$/|U/;
    }
    print O;
}
[download]

Note that if you have duplicate keys with varying data in the first file (i.e. multiple rows with the same key but different values in other fields), and those keys also show up in the second file, there will probably be trouble. The above approach only keeps track of one row value for a given key.

(update: added the missing file handle arg 'O' on the print statements)

Comment on Re: file delta detection Select or Download Code