in reply to Compare Arrays with a Count of Matches

Maybe a more robust comparision would be to say "yes" when all the words that are in the string with fewer words are also in the other string, after prepositions and other glue-words like in,of,at,by,... have been removed. That would have the advantage of matching strings with fewer than 3 words and strings where only the prepositions differ. Depends on the specifics of your data naturally.

  • Comment on Re: Compare Arrays with a Count of Matches

Replies are listed 'Best First'.
Re^2: Compare Arrays with a Count of Matches
by jhourcle (Prior) on Feb 20, 2009 at 18:55 UTC
    Depends on the specifics of your data naturally

    I've had to do it before ... luckily, I had a 'master' list of schools to work from, because it was for a state board of licensure, so I could be (reasonably) assured that all of the schools were accredited

    In my particular case, I ran into situations like the following:

    # there are four different schools in this list: U Maryland U Maryland College Park U Maryland Baltimore U Maryland Baltimore Campus U Maryland at Baltimore U Maryland Baltimore County U Baltimore

    Of course, it was _much_ worse than that, but there were some recognizable patterns (not including punctuation / capitalization):

    U of (state) Univ (state) U (state) University of (state) (state)

    The messy part was when they started mixing universities and colleges ('Speed School' is the 'University of Louisville'; 'Clark School' is 'University of Maryland, College Park')