comment on

Others have said this as well, but the UNIX utilities (sort , diff, sed) will make your life MUCH easier here.

I dealt with this problem a few years ago when I worked for a now-dead price-comparison site. We were getting CSV/TSV data dumps from online vendors daily, some of these files were 300+M in size (e.g. 500,000 books), and we only wanted what had changed from the previous dump.

Our system was a pretty complex perl app, with config files for each vendor that described what the format of the file was, how to clean it up (none of them delivered 100% clean CSV files), what column to sort on, etc.

The perl app didn't do any actual file processing itself - it was simply an easy way to handle config files and pass arguments to the various UNIX utils. It worked something like this:

Remove the header line (if needed) using head.
Use sed to clean up any potential issues, reformatting if necessary to make things easier for sort.
Use sort to re-order the file based on the unique-key column.
If necessary, use uniq to strip out duplicate rows (some of the vendors had multiple entries they didn't even know about).
diff the newly-generated file against the last one we processed, to see what changed.
Parse the results of the diff to determine which rows were adds, which were deletes, and which were changes. Those became the basis for SQL insert/delete/update statements against the main product DB.

This saved our bacon. We were drowning in data (about 5G/day, when our average server was a 400Mhz Pentium w/ 256M of RAM and 10G of storage), and only about 3-5% of the rows in any given file changed from the previous dump.

If your data is of any appreciable size, don't do the actual file-processing in perl, use the unix utils - it'll be much faster and more memory-efficient than anything you'll do in perl.

In reply to Re: CSV Diff Utility by swngnmonk
in thread CSV Diff Utility by Limbic~Region

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.