bman has asked for the wisdom of the Perl Monks concerning the following question:

Actually, the title is not quite true. What I am doing is comparing values from one file against another and based on that comparison, write to a third file. I am starting to go bannas here and I have been keeping on re-writing code numerous times. Sometimes it was working and sometimes it wasn't. Because a pictures is worth a 1000 words, here is a piece of code:
@matched = sort @matched; @phoneBook = sort @phoneBook; foreach (@phoneBook) { @fields = split(',', $_); my $fullname = $fields[0]; $fullname =~ tr/[()]//d; $fullname =~ s/\s+/ /; foreach (@matched) { my @record = split(',', $_); my $lic = $record[$#record]; chomp($lic); my ($lname, $fname) = split('\s+', $record[5]); my $firstn = substr($fname, 0, 2); my $name = "$lname " . "$firstn"; $name =~ tr/[a-z]/A-Z/; if (defined $fullname =~ m/$name/g) { print RESULTS "$record[0],UNKNOWN,$lic,$fullname,$fields[1 +],$fields[2],$fields[3]\n"; } } }
-----------------

What I am doing is comparing names from one file (inner loop) against a bigger file (outer file). If it matches, I simply want to write it out. But it's not happening here. If everything goes fine, I should get about 154 matches. However, I'm getting over 90,000 of them (while the outer loop file contains only about 260 records). Can someone, please tell me what's going on here?

On the same note, then, I want to find out how many records from the inner file do not match the outer. My thinking was "I would simply reverse the loops and do !~." This, however, also seems not to work.

At this point, I would greatly appreciate any hints, directions I should take to resolve it.

Thanks.

Replies are listed 'Best First'.
Re: Comparing two files
by japhy (Canon) on Sep 10, 2001 at 17:35 UTC
    Gah! tr/// IS character classes already. Please remove the brackets from the left-hand side. Also, don't use tr/// there. Just use a function like uc().

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Comparing two files
by dragonchild (Archbishop) on Sep 10, 2001 at 17:40 UTC
    I've got a feeling you don't understand the problem statement. This means you won't understand any of the possible solutions well enough to code them.

    Problem Statement: Take the values from two files and find those values that are the same. Print those to a third file.

    Problem Solution: Since it seems that you just want to find all values in file 2 that exist in file 1, you don't really care if there are duplicates in either file or not. Existence it the important thing. So, it sounds like hashes are your friend here.

    my @file1 = <FILE1>; my @file2 = <FILE2>; # Here is where you would normalize the data. Things like uc, lc, ucfi +rst, s/\s//g, and the like. my (%file1, %file2); $file1{$_} = 1 foreach @file1; $file2{$_} = 1 foreach @file2; foreach my $value (sort keys %file1) { if ($file2{$value}) { print FILE3 "$value\n"; } }
    By looking at things this way, you can then easily find out which in file 2 aren't in file 1.
    # Using the same data structures as above ... foreach my $value (sort keys %file2) { unless ($file1{$value}) { print FILE3 "$value\n"; } }
    Also, this lends itself to a counting of the instances, if you expect duplicates and you care. You can change the hash populating part to
    my (%file1, %file2); $file1{$_}++ foreach @file1; $file2{$_}++ foreach @file2;
    You could then do a <code>print FILE3 "$value : $file1{$value}\n";

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      I thought I understood the problem at hand. :-) Anyway, I'm here to learn and I'm the first one to admit that I lack PERL finess and technique to solve some of the problems I encounter. Also, I have to admit that 'hashes' are not the strongest suite of my skills in PERL right now (I'm still not too comfortable with them, although, I can see a huge benifit by mastering them). Having said that, this is why I chose array approach and line by line comparison. Let's see if my thinking in doing so was flawed:
      Inner file:
      Consists roughly of 154 records. If I could match 154 against outer file, I would be very happy.
      Outter file:
      Consists of about 256 records which contain various info about a user
      Those two files are comma delimited and have some other info than names as well. They very in their field number.
      Problem:
      The only thing these two files have in common are some similarities in their names. To make things more difficult, the inner file is maintained manually so a human error factor has to be taken into account. Also, the outter file name might also have a middle name.

      To make sure that I can achieve the maximum number of hits, I split the names of the inner file into:

      lastname substr[$firstname, 0, 2]
      so I can catch:
      • Smith Steve
      • Smith Stephan
      but not
      • Smith Agnes
      • Smith John
      • Smith Adam
      and so on.
      Sudo code:
      Only important elements are left below to make it short:

      @file1 = <PHONEBOOK>; @file2 = <MATCHED>; foreach (@file1) { @record = split(',', $_); # so I can get name only $fullname = tr/()//d; # I don't want ( or ) in it $fullname = s/\s+/ /g; # Substitute one or more spaces anywhere wi +th one space only (per space matched) # Now, I have $fullname with info I want to match against. foreach (file2) { @fileds = split(',', $_); # to get name only my ($lname, $fname) = split(/\s+/, $fields[5]); # so I can do +a substring on the first name my $name = "$lname " . substr($fname, 0, 2); $name =~ tr/a-z/A-Z/; # You can tell me to use uc() function b +ut for now I will use what I know if ($fullname =~ m/^$name/g) { # at this point if a string from my inner loop matches the + one from the outter, print it out } } }
      So, in theory this code should work (I know it should because it was working at some point) and from what I wrote above, doesn't it look like I understand what I want to do?

      Using hashes here would be nice if each file only had one elment per line where an element from line 1 in outter file would be a key for a value of element from line 1 in the inner file, which is not the case here.

        *winces* Ok. Some meta-coding knowledge seems to be demanded here.
        1. Doing what you're doing is an exercise in normalization, not comparison.
        2. Doing what you're doing is a good way to go insane.
        To use what I discussed, we need to do each step in order.
        1. Read in the data
        2. Normalize the data
        3. Compare the data
        Reading it in is easy. @file1 = <FILE1>. Whee!

        The normalization part seems to be tripping you up. What this step entails is to take data from a source and manipulate it so that it is in a form you can easily work with. The idea is to then populate a second data structure, then work with that data structure.

        So, you'd read into an array. For each element in that first array, you'd manipulate it and put it into a second data structure (hash, array, whatever). You'd then use that second data structure for any operations, such as comparisons. This way, you know that all your data sources speak the same language.

        So, what you'd do is something like:

        my @file1 = <FILE1>; my %file1 = normalize_file1(@file1); my @file2 = <FILE1>; my %file2 = normalize_file2(@file2); # Do the comparisons here. Use what I gave before.
        So, we've brought it down to just normalization procedure. As you've noticed, this is easily the most complex part of the whole deal. Let's take the first file as an example to work with. (You do the other one. *grins*)

        Design: You're getting a comma-delimited line. You're interested in one field. That field will be in one of two formats. What you're interested in comparison is a manipulation of that field. (This assumes that the name is the third field.)

        sub normalize_file1 { my @file1 = @_; my %file1; LINE: foreach my $line (@file1) { my @fields = split /,/, $line; next LINE unless @fields; # Note the use of uc here. my @name = split /\s+/, uc $fields[2]; if (@name == 3) { # Have middle name my $name = "$name[0] " . substr($name[2], 0, 2); } elsif (@name == 2) { # No middle name my $name = "$name[0] " . substr($name[1], 0, 2); } else { # Error state die "Bad name in normalize_file1(): $line\n"; } $file1{$name} = 1; } return %file1; }

        (For those anal-retentive people, yes, I could've used hashrefs and listrefs. Why confuse the issue when this works just as well algorithm-wise, if less efficiently.)

        This will take the array of lines from FILE1 and return a hash, whose keys are "SMITH ST", for example. You would then write a similar function for FILE2. Now, don't go nuts about data entry error. Your program exists solely to take data and manipulate it. You're not writing an error-correction program here.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: Comparing two files
by physi (Friar) on Sep 10, 2001 at 17:35 UTC
    Well , your line:
    $fullname =~ s/\s+/ /;
    will only change the first whitespaces into ' ' !!!
    You may want to change it to:
    $fullname =~ s/\s+/ /g;
    and try to change:
    my ($lname, $fname) = split('\s+', $record[5]);
    into
    my ($lname, $fname) = split( /\s+/, $record[5]);
    otherwise you split the string with "\s+" as a delimiter, not with ' ' !

    Hope this helps..

    ----------------------------------- --the good, the bad and the physi-- -----------------------------------