Re: CSV Diff Utility

Sort the files by the key field prior to comparison where the key field will be first followed by the remaining fields in ASCIIbetical order (potentially very expensive upfront operation)

If you have the unix-style "head", "tail" and "sort" utilities, you may be able to reduce the cost by opening the inputs this way:

$old_header = `head -1 $old_csv`;  # split that later
# get $new_header same way, if it's different from $old_header

open( OLD, "tail +2 $old_csv | sort |" ) or die $!;
open( NEW, "tail +2 $new_csv | sort |" ) or die $!;

# proceed with interleaved reading as planned...
[download]

Of course, this assumes that each row of csv data does not contain newlines, which may be inappropriate. If there may be newlines within some data fields, you'll need to parse the csv first, then sort; in which case you might consider storing key values and byte offsets/lengths in a hash, so you can sort the hash keys, and rewrite the data records to a sorted file by seeking and reading each record in turn.

Comment on Re: CSV Diff Utility Download Code

Replies are listed 'Best First'.
Re^2: CSV Diff Utility by Limbic~Region (Chancellor) on Jun 23, 2004 at 12:34 UTC
graff, I have considered nix utilities, but it requires more assumptions than just not having imbedded newlines. For instance, if the key field is the 3rd column, you need to sort by the 3rd column and not the first. CSV can get quite messy and even with the power of awk*, it will likely have to be changed for each new type of CSV encountered. Originally, I figured if I needed to pre-process the file anyway, I might as well just go ahead and do the sort in Perl. That could potentially be a "bad" idea given the size of these files is currently unknown (at least to me). I really like the idea of indexing information and sorting that rather than the whole file. I will give this some more thought and perhaps will come up with a hybrid. Cheers - L~R	[reply]

Replies are listed 'Best First'.

Re^2: CSV Diff Utility
by Limbic~Region (Chancellor) on Jun 23, 2004 at 12:34 UTC

graff

awk

Originally, I figured if I needed to pre-process the file anyway, I might as well just go ahead and do the sort in Perl. That could potentially be a "bad" idea given the size of these files is currently unknown (at least to me). I really like the idea of indexing information and sorting that rather than the whole file. I will give this some more thought and perhaps will come up with a hybrid.

Cheers - L~R

[reply]