You don't state the capabilities of the machine you are running on or the width of the records. For example, if they are 10K wide, you would need a monstrous machine to load the files into Perl memory.
So to provide a reasonably general solution, that should work for most machines, I would do quite a lot with unix power tools, starting with:
awk 'BEGIN{ FS="'" } { print $3 "|" $0 }' < file1 | sort > mod1.sor
awk 'BEGIN{ FS="'" } { print $3 }' < file1 | sort > keys1.sor
awk 'BEGIN{ FS="'" } { print $3 "|" $0 }' < file2 | sort > mod2.sor
awk 'BEGIN{ FS="'" } { print $3 }' < file2 | sort > keys2.sor
comm -12 keys1.sor keys2.sor > xk.sor
comm -13 mod1.sor mod2.sor > t.sor
comm -23 mod1.sor mod2.sor > d.sor
The xk.sor file has the keys common to both files. t.sor (key column now being appended on the front) contains the mixture of records destined to be either T or U (must be one or the other) and d.sor the mixture of D and U records in the older file (or rather file of older data of the two input files to this process - also now with key appended to front).
Now you can load xk.sor into Perl as hash keys to identify the 'U' records in t.sor (all others being the T). The xk.sor hash can be used similarly to eliminate the 'U' records from d.sor (all the others being the D. You can remove the key column we appended on the front at output time.
In regard to Perl language elements needed: hashes to hold the keys, split function to split delimited records into arrays. shift to remove the first element from an array. open to open files for input or output and the <> operator to read from files. Need anything more for this? (update: apart from print to output the lines with their appendages, minus the key-on-the-front which we used to make unix sort work without difficulty - unix sort has a key definition possibility but this is too awkward with delimiters - but unix sort has built-in disk-swapping facilities for huge file processing)
| [reply] [d/l] |
My fellow monk has already provided a good perl solution. I just want to add that if you're on a unix box, your problem can be solved with a couple unix commands. cut -f3 -d| X | sort > outx.txt; cut -f3 -d| Y | sort > outy.txt;
comm outx.txt outy.txt;
Fair warning, I haven't tested these lines. There may be syntax typos. If there are, just read the man pages for cut, sort, and comm. Enjoy.
UPDATE: Simon provides a much more encompassing answer with unix solutions. | [reply] [d/l] |
cut -f3 -d\| X | sort > outx.txt
cut -f3 -d'|' Y | sort > outy.txt
Provided that part is done right, I prefer using "cut" rather than "awk" -- it's just easier (and probably faster). | [reply] [d/l] |
All assuming that the key value is unique!
Sorry about the long lines ...
#!
# Just For A Lark
use strict;
use warnings;
use Digest::MD5;
# Hash for X
my %X_h;
{
open(my $IN,'<','X')
or die "Can't open 'X'! $!";
# Populate hash with third field and SizeAndMD5
while (<$IN>) {
chomp;
$X_h{(split('|'))[2]}=SizeAndMD5($_);
};
close($IN)
or die "Can't close 'X'! $!";
};
### %X_h
# Processing Y
# Hash for Y
my %Y_h;
{
open(my $IN,'<','Y')
or die "Can't open 'Y'! $!";
open(my $OUT,'>','Y.new')
or die "Can't open 'Y.new'! $!";
# Populate hash with third field and SizeAndMD5
while (<$IN>) {
chomp;
$Y_h{my $key=(split('|'))[2]}=SizeAndMD5($_);
unless (exists $X_h{$key}) { # No key in X so append a "D"
print $OUT "$_|D\n";
}
elsif ($Y_h{$key} ne $X_h{$key}) { # The corresponding record
+in X is different
print $OUT "$_|U\n";
}
else { # The corresponding record in X is the same
print $OUT "$_|\n";
};
};
close($OUT)
or die "Can't close 'Y.new'! $!";
close($IN)
or die "Can't close 'Y'! $!";
};
### %Y_h
# Re-processing X
{
open(my $IN,'<','X')
or die "Can't open 'X'! $!";
open(my $OUT,'>','X.new')
or die "Can't open 'X.new'! $!";
# Populate hash with third field and SizeAndMD5
while (<$IN>) {
chomp;
my $key=(split('|'))[2];
unless (exists $Y_h{$key}) { # No key in X so append a "I"
print $OUT "$_|I\n";
}
elsif ($X_h{$key} ne $Y_h{$key}) { # The corresponding record
+in Y is different
print $OUT "$_|U\n";
}
else { # The corresponding record in Y is the same
print $OUT "$_|\n";
};
};
close($OUT)
or die "Can't close 'X.new'! $!";
close($IN)
or die "Can't close 'X'! $!";
};
exit;
# Returns Size and MD5 of the string as "%8.8lx%32.32s"
sub SizeAndMD5 {
return sprintf("%8.8lx%32.32s",length($_[0]),Digest::MD5::md5_
+hex($_[0]));
};
| [reply] [d/l] |
The unix-oriented replies will probably help the most, given the quantity of data you have, but in case the size of the key field isn't terribly large, an old perl utility of mine might be useful here: cmpcol.
If the two input files are really large (lots of wide fields in each row), you probably want to read them just once to get the key distribution info, then just once more to create the updated version of each file. To that end, cmpcol would do the first step like this:
cmpcol -d '|' -us X:3 Y:3 > xy-keys.union
(updated to include ":3" as the key-column specifier on each file name)
The output will have one line per distinct key value, followed by space, then a token to indicate where the key was found, e.g.:
1st_key_value <1
2nd_key_value <12
3rd_key_value <2
...
(Keys are listed in ascii-betic sorted order.) Keys found in both files get <12, keys only in X or only in Y get <1 or <2 respectively.
(If a key occurs more than once in a given file, you'll see a "+" next to the file number, i.e.: <+12 or <12+ or <+12+ for "non-unique in file 1, in file 2, in both files", respectively.)
Once you have that output, it's simple to append "|I" to records in X, and "|D" to records in Y. Adding "|U" (to file Y? to file X? to both files?) is trickier, because you have to work out whether the '<12' keys occur with identical or differing content in the two source files, and depending on the quantity of data, this might cause issues with memory consumption. But it's worth making a first try with an approach that's easy to code up -- for simplicity, I'll assume that the "|U" thing only needs to go into file Y (if it only needs to go into file X, just change the code to read Y first, then X):
#!/usr/bin/perl
use strict;
my %key;
open( C, '<', 'xy-keys.union' ) or die "xy-keys.union: $!\n";
while (<C>) {
chomp;
my ( $k, $v ) = ( /^ (.*) < \+? ([12]+) \+? $/x ); # ignore dup.ke
+y (+) marks
$key{$k} = $v;
}
my %common;
open( I, '<', 'X' ) or die "X: $!\n";
open( O, '>', 'X.out' ) or die "X.out: $!\n";
while (<I>) {
my $k = ( split /\|/ )[2];
if ( $key{$k} eq '1' ) {
s/$/|I/;
}
else {
$common{$k} = $_;
}
print O;
}
open( I, '<', 'Y' ) or die "Y: $!\n";
open( O, '>', 'O.out' ) or die "O.out: $!\n";
while (<I>) {
my $k = (split /\|/)[2];
if ( $key{$k} eq '2' ) {
s/$/|D/;
}
elsif ( $_ ne $common{$k} ) {
s/$/|U/;
}
print O;
}
Note that if you have duplicate keys with varying data in the first file (i.e. multiple rows with the same key but different values in other fields), and those keys also show up in the second file, there will probably be trouble. The above approach only keeps track of one row value for a given key.
(update: added the missing file handle arg 'O' on the print statements) | [reply] [d/l] [select] |