comment on

You don't state the capabilities of the machine you are running on or the width of the records. For example, if they are 10K wide, you would need a monstrous machine to load the files into Perl memory.

So to provide a reasonably general solution, that should work for most machines, I would do quite a lot with unix power tools, starting with:

awk 'BEGIN{ FS="'" } { print $3 "|" $0 }' < file1 | sort > mod1.sor
awk 'BEGIN{ FS="'" } { print $3 }' < file1 | sort > keys1.sor
awk 'BEGIN{ FS="'" } { print $3 "|" $0 }' < file2 | sort > mod2.sor
awk 'BEGIN{ FS="'" } { print $3 }' < file2 | sort > keys2.sor
comm -12 keys1.sor keys2.sor > xk.sor
comm -13 mod1.sor mod2.sor > t.sor
comm -23 mod1.sor mod2.sor > d.sor
[download]

The xk.sor file has the keys common to both files. t.sor (key column now being appended on the front) contains the mixture of records destined to be either T or U (must be one or the other) and d.sor the mixture of D and U records in the older file (or rather file of older data of the two input files to this process - also now with key appended to front).

Now you can load xk.sor into Perl as hash keys to identify the 'U' records in t.sor (all others being the T). The xk.sor hash can be used similarly to eliminate the 'U' records from d.sor (all the others being the D. You can remove the key column we appended on the front at output time.

In regard to Perl language elements needed: hashes to hold the keys, split function to split delimited records into arrays. shift to remove the first element from an array. open to open files for input or output and the <> operator to read from files. Need anything more for this? (update: apart from print to output the lines with their appendages, minus the key-on-the-front which we used to make unix sort work without difficulty - unix sort has a key definition possibility but this is too awkward with delimiters - but unix sort has built-in disk-swapping facilities for huge file processing)

One world, one people

In reply to Re: file delta detection by anonymized user 468275
in thread file delta detection by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.