in reply to Re: Re: Comparing two files
in thread Comparing two files

*winces* Ok. Some meta-coding knowledge seems to be demanded here.
  1. Doing what you're doing is an exercise in normalization, not comparison.
  2. Doing what you're doing is a good way to go insane.
To use what I discussed, we need to do each step in order.
  1. Read in the data
  2. Normalize the data
  3. Compare the data
Reading it in is easy. @file1 = <FILE1>. Whee!

The normalization part seems to be tripping you up. What this step entails is to take data from a source and manipulate it so that it is in a form you can easily work with. The idea is to then populate a second data structure, then work with that data structure.

So, you'd read into an array. For each element in that first array, you'd manipulate it and put it into a second data structure (hash, array, whatever). You'd then use that second data structure for any operations, such as comparisons. This way, you know that all your data sources speak the same language.

So, what you'd do is something like:

my @file1 = <FILE1>; my %file1 = normalize_file1(@file1); my @file2 = <FILE1>; my %file2 = normalize_file2(@file2); # Do the comparisons here. Use what I gave before.
So, we've brought it down to just normalization procedure. As you've noticed, this is easily the most complex part of the whole deal. Let's take the first file as an example to work with. (You do the other one. *grins*)

Design: You're getting a comma-delimited line. You're interested in one field. That field will be in one of two formats. What you're interested in comparison is a manipulation of that field. (This assumes that the name is the third field.)

sub normalize_file1 { my @file1 = @_; my %file1; LINE: foreach my $line (@file1) { my @fields = split /,/, $line; next LINE unless @fields; # Note the use of uc here. my @name = split /\s+/, uc $fields[2]; if (@name == 3) { # Have middle name my $name = "$name[0] " . substr($name[2], 0, 2); } elsif (@name == 2) { # No middle name my $name = "$name[0] " . substr($name[1], 0, 2); } else { # Error state die "Bad name in normalize_file1(): $line\n"; } $file1{$name} = 1; } return %file1; }

(For those anal-retentive people, yes, I could've used hashrefs and listrefs. Why confuse the issue when this works just as well algorithm-wise, if less efficiently.)

This will take the array of lines from FILE1 and return a hash, whose keys are "SMITH ST", for example. You would then write a similar function for FILE2. Now, don't go nuts about data entry error. Your program exists solely to take data and manipulate it. You're not writing an error-correction program here.

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Replies are listed 'Best First'.
Re: Re: Re: Re: Comparing two files
by bman (Sexton) on Sep 13, 2001 at 16:23 UTC
    It's only fair to follow up on my own question. Thanks to "dragonchild" for the insights and tips on how to solve my issue. The good news is that I have solved that issue but not the way you showed me (hashes). I know, some will say that that would be the best way to approach it but at this point, at least, I try to use the skills I know to their fullest plust I was in a time crunch so could not spend to much time on disecting hashes and making them work. In any case, here is my final code that works:

    foreach (@matched) { my @record = split(/,/, $_); my $lic = $record[$#record]; chomp($lic); my ($lname, $fname) = split(/\s+/, uc($record[5])); my $name = "$lname " . substr($fname, 0, 2); # If we find a matching record, then we write it out to a file if (grep /$name/, @phoneBook) { my @line = grep /$name/, @phoneBook; foreach my $rec (0 .. $#line) { if ($rec == 0) { # In case I had more than one record (lik +e the same user multiple times), I only limit results to 1 @line = split(/,/, $line[$rec]); + print RESULTS "$record[0],UNKNOWN,$lic,SLC,$line[0],$l +ine[1],$line[2],$line[3],$line[4],$name\n"; } } } else { chomp(@record); print UNMATCHED "$record[0],$record[5],$record[7],$record[2],$ +record[3],$record[4],UNKNOWN,\\N,\\N\n"; } }
    A use of PERL's grep function solved my problem in a couple of lines rather than trying to write functions and use hashes.... :-)

    But again, thanks for all your help!

      First off, this iteration is significantly improved over the last version. This is a very good thing!

      Secondly, I'm glad you found my response useful. That you choose not to use hashes is your own business. I just call'em as I see'em.

      A few thoughts on your current solution:

      1. Good usage of my. I commend you.
      2. While chomp does work over a list, I had to look that up just now. I suspect that most Perl'ers wouldn't know that, either. I'd recommend commenting that. (Or, I'm just stupid, which is a distinct possibility!)
      3. chomp the line, then split it. It's more intuitive. Or, just do the chomp over @matched. Plus, you chomp @record when it's already been chomped above. That's redundant.
      4. You can find a better variable name than $lic. Don't shorten a variable name. You'll spend more time trying to figure out what that variable was than typing a longer name. Or, since you only use it once, why even create it?
      5. You do a grep twice through @phoneBook. It's better to get the matches, then check to see if you have any. Only one grep, which can be an expensive action.
      6. If you only want to work with the first match, why loop through all the matches? If you only want the first, use $lines[0] or, even better, discard all the matches but the first.
      7. If you're doing something across 5 records, use an arrayslice or a for-loop. It's harder to read if it's all written out. I know that if I see five explicit accesses to an array, I'm looking for a reason. Hopefully, the reason isn't that the writer doesn't know about slicing or for-loops. :)
      chomp @matched; foreach my $match (@matched) { my @record = split(/,/, $match); my ($lname, $fname) = split(/\s+/, uc($record[5])); my $name = "$lname " . substr($fname, 0, 2); if (my ($first_matched) = grep /$name/, @phoneBook) { my @line = split /,/, $first_matched; print RESULTS "$record[0],UNKNOWN,$record[$#record],SLC,"; print RESULTS "$line[$_]," for (0 .. 4); print RESULTS "$name\n"; } else { print UNMATCHED "$record[$_]," for (0,5,7,2,3,4); print UNMATCHED "UNKNOWN,\\N,\\N\n"; } }

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        This group is a gold mine indeed. :-) Some of the solutions, usage of print and so on you provided here as an example can be HARDLY found in any of the books and I've been through quite a few of them. I learned something new today again... :-)

        Thanks a lot!