The unix-oriented replies will probably help the most, given the quantity of data you have, but in case the size of the key field isn't terribly large, an old perl utility of mine might be useful here: cmpcol.

If the two input files are really large (lots of wide fields in each row), you probably want to read them just once to get the key distribution info, then just once more to create the updated version of each file. To that end, cmpcol would do the first step like this:

cmpcol -d '|' -us X:3 Y:3 > xy-keys.union

(updated to include ":3" as the key-column specifier on each file name)

The output will have one line per distinct key value, followed by space, then a token to indicate where the key was found, e.g.:

1st_key_value <1 2nd_key_value <12 3rd_key_value <2 ...
(Keys are listed in ascii-betic sorted order.) Keys found in both files get <12, keys only in X or only in Y get <1 or <2 respectively. (If a key occurs more than once in a given file, you'll see a "+" next to the file number, i.e.: <+12 or <12+ or <+12+ for "non-unique in file 1, in file 2, in both files", respectively.)

Once you have that output, it's simple to append "|I" to records in X, and "|D" to records in Y. Adding "|U" (to file Y? to file X? to both files?) is trickier, because you have to work out whether the '<12' keys occur with identical or differing content in the two source files, and depending on the quantity of data, this might cause issues with memory consumption. But it's worth making a first try with an approach that's easy to code up -- for simplicity, I'll assume that the "|U" thing only needs to go into file Y (if it only needs to go into file X, just change the code to read Y first, then X):

#!/usr/bin/perl use strict; my %key; open( C, '<', 'xy-keys.union' ) or die "xy-keys.union: $!\n"; while (<C>) { chomp; my ( $k, $v ) = ( /^ (.*) < \+? ([12]+) \+? $/x ); # ignore dup.ke +y (+) marks $key{$k} = $v; } my %common; open( I, '<', 'X' ) or die "X: $!\n"; open( O, '>', 'X.out' ) or die "X.out: $!\n"; while (<I>) { my $k = ( split /\|/ )[2]; if ( $key{$k} eq '1' ) { s/$/|I/; } else { $common{$k} = $_; } print O; } open( I, '<', 'Y' ) or die "Y: $!\n"; open( O, '>', 'O.out' ) or die "O.out: $!\n"; while (<I>) { my $k = (split /\|/)[2]; if ( $key{$k} eq '2' ) { s/$/|D/; } elsif ( $_ ne $common{$k} ) { s/$/|U/; } print O; }
Note that if you have duplicate keys with varying data in the first file (i.e. multiple rows with the same key but different values in other fields), and those keys also show up in the second file, there will probably be trouble. The above approach only keeps track of one row value for a given key.

(update: added the missing file handle arg 'O' on the print statements)


In reply to Re: file delta detection by graff
in thread file delta detection by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.