comment on

*winces* Ok. Some meta-coding knowledge seems to be demanded here.

Doing what you're doing is an exercise in normalization, not comparison.
Doing what you're doing is a good way to go insane.

To use what I discussed, we need to do each step in order.

Read in the data
Normalize the data
Compare the data

Reading it in is easy. @file1 = <FILE1>. Whee!

The normalization part seems to be tripping you up. What this step entails is to take data from a source and manipulate it so that it is in a form you can easily work with. The idea is to then populate a second data structure, then work with that data structure.

So, you'd read into an array. For each element in that first array, you'd manipulate it and put it into a second data structure (hash, array, whatever). You'd then use that second data structure for any operations, such as comparisons. This way, you know that all your data sources speak the same language.

So, what you'd do is something like:

my @file1 = <FILE1>;
my %file1 = normalize_file1(@file1);

my @file2 = <FILE1>;
my %file2 = normalize_file2(@file2);

# Do the comparisons here. Use what I gave before.
[download]

So, we've brought it down to just normalization procedure. As you've noticed, this is easily the most complex part of the whole deal. Let's take the first file as an example to work with. (You do the other one. *grins*)

Design: You're getting a comma-delimited line. You're interested in one field. That field will be in one of two formats. What you're interested in comparison is a manipulation of that field. (This assumes that the name is the third field.)

sub normalize_file1 {
    my @file1 = @_;

    my %file1;

    LINE:
    foreach my $line (@file1) {
        my @fields = split /,/, $line;
        next LINE unless @fields;

        # Note the use of uc here.
        my @name = split /\s+/, uc $fields[2];
        if (@name == 3) { # Have middle name
            my $name = "$name[0] " . substr($name[2], 0, 2);
        } elsif (@name == 2) { # No middle name
            my $name = "$name[0] " . substr($name[1], 0, 2);
        } else { # Error state
            die "Bad name in normalize_file1(): $line\n";
        }
        $file1{$name} = 1;
    }

    return %file1;
}
[download]

(For those anal-retentive people, yes, I could've used hashrefs and listrefs. Why confuse the issue when this works just as well algorithm-wise, if less efficiently.)

This will take the array of lines from FILE1 and return a hash, whose keys are "SMITH ST", for example. You would then write a similar function for FILE2. Now, don't go nuts about data entry error. Your program exists solely to take data and manipulate it. You're not writing an error-correction program here.

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

In reply to Re: Re: Re: Comparing two files by dragonchild
in thread Comparing two files by bman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.