comment on

File::Sort is really not suited to parsing CSV files properly. If they are simple enough, it's possible, but CSV files rarely remain simple enough.

How to generically compare two CSV files is difficult to answer. It depends on whether or not you can read the entire file into memory, and if their fields match. The very simplest method would be to normalize your CSV files, sort them, and then diff them.

The simplest way of normalizing them is to parse them, and then spit them back out; if you do this with the same module for each (using the same options), theoretically any rows with the same values would output the same.

Normalizing with Text::CSV_XS is straightforward:

#!/usr/bin/perl

use Text::CSV_XS;
use warnings;
use strict;


{
    die("usage: $0 [<file>]\n") if @ARGV > 1;
    my($file, $fh);
    if (@ARGV) {
        $file = $ARGV[0];
        open($fh, '<', $file)
            || die("Unable to open file '$file': $!.\n");
    } else {
        $file = '-';
        $fh   = \*STDIN;
    } 

    my $csv = Text::CSV_XS->new({ binary => 1, eol => "\015\012" });
    while (my $row = $csv->getline($fh)) {
        $csv->print(\*STDOUT, $row);
    }

    die("Error parsing CSV file '$file': ", $csv->error_diag, "\n")
        if $csv->error_diag and not $csv->eof;
}
[download]

(My first pass used *ARGV, but this results in some odd diagnostics and weird edge cases.)

At this point, you simply sort the output. Field values and the header are irrelevant; you're simply trying to make all of your CSV files consistent so diff can make some sense of it.

diff -u <(csv-normalize csv1.csv | sort) <(csv-normalize csv2.csv | sort)

This is the simplest and quickest way of comparing two CSV files. It has the advantage of being able to work on relatively large CSV files quickly, but it won't work if the field layout differs between them.

In reply to Re: File::Sort issues by Somni
in thread File::Sort issues by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.