de2425 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks;

I was hoping that someone might be able to help me with a problem. I have a file with 2 lists of data, basically institution names. The data is similar but it is not exactly the same. For instance, in one column might have:

UNIVERSITY OF ILLINOIS
and another might have:
University of Illinois at Chicago
What I am trying to do is split each string into an array of the words. Then I'm wanting to set a criteria of if there are 3 or more matching words, a Y is printed. If not, a N is printed. Below is what I've tried so far. I know that I'm missing something simple but I cannot figure out what. Any suggestions? (btw, I know I should use my in front of the variable and such but I get, um, complained to for lack of another useable term when I do).

!#/usr/bin/perl -w use strict; use warnings; open (IN, "C:/Documents and Settings/devans/Desktop/grant_author_name_ +mapped.txt"); open (OUT, ">C:/Documents and Settings/devans/Desktop/grant_author_nam +e_mapped_.txt"); $count; while (<IN>){ chomp; @t=split(/\t/,$_); $t[2]=~s/\(//gi; $t[2]=~s/\)//gi; $t[8]=~s/\(//gi; $t[8]=~s/\)//gi; ucfirst $t[2]; @b=split(/\ /,$t[8]); @a=split(/\ /,$t[2]); foreach (@a){ if ($_ and exists $b{$_}){ $count++; print "$count\n"; } } if ($count ge 3){ print OUT "$t[1]\t$t[2]\t$t[3]\t$t[4]\tY\t$t[6]\t$t[7]\ +t$t[8]\n"; } else{ print OUT "$t[1]\t$t[2]\t$t[3]\t$t[4]\tN\t$t[6]\t$t[7]\t$ +t[8]\n"; } } close OUT; close IN;

Replies are listed 'Best First'.
Re: Compare Arrays with a Count of Matches
by kennethk (Abbot) on Feb 19, 2009 at 16:24 UTC
    First, please do not include use strict; use warnings; at the head of a post if the code you are running doesn't use it. It is misleading and unhelpful. Use of those terms is to catch mistakes, such as the attempted use of an array as a hash on line 23. I also note that you have a bang-hash on your first line in place of a hash-bang, which implies you added that to your post as well. Also note that the -w switch is redundant with use warnings;.
Re: Compare Arrays with a Count of Matches
by hbm (Hermit) on Feb 19, 2009 at 16:42 UTC

    Other starter comments:

    1. When deleting parens from your string, you don't need /i; and I'm guessing you could replace them all before you split, like this?
      s/[()]//g; @t=split(/\t/,$_);
    2. Don't you want to reset $count? The easiest way to do that would be to put my $count; immediately below while (<IN>).
    3. You could simplify your primary print statement like this:
      my $yn = ($count ge 3) ? 'Y' : 'N'; print OUT join("\t", @t[1..4], ${yn}, $t[6..8]), "\n";
      But are you overlooking the first element, $t[0]?
Re: Compare Arrays with a Count of Matches
by jethro (Monsignor) on Feb 19, 2009 at 17:29 UTC

    Maybe a more robust comparision would be to say "yes" when all the words that are in the string with fewer words are also in the other string, after prepositions and other glue-words like in,of,at,by,... have been removed. That would have the advantage of matching strings with fewer than 3 words and strings where only the prepositions differ. Depends on the specifics of your data naturally.

      Depends on the specifics of your data naturally

      I've had to do it before ... luckily, I had a 'master' list of schools to work from, because it was for a state board of licensure, so I could be (reasonably) assured that all of the schools were accredited

      In my particular case, I ran into situations like the following:

      # there are four different schools in this list: U Maryland U Maryland College Park U Maryland Baltimore U Maryland Baltimore Campus U Maryland at Baltimore U Maryland Baltimore County U Baltimore

      Of course, it was _much_ worse than that, but there were some recognizable patterns (not including punctuation / capitalization):

      U of (state) Univ (state) U (state) University of (state) (state)

      The messy part was when they started mixing universities and colleges ('Speed School' is the 'University of Louisville'; 'Clark School' is 'University of Maryland, College Park')