srdst13 has asked for the wisdom of the Perl Monks concerning the following question:
I have a legacy database (actually a single fully de-normalized table with MANY columns) that I need to move over to SQL and a more orderly data structure. However, much of the data was entered as free-text, so I have columns of things such as names, cities, etc. that vary slightly depending on who did the entering. In the simplest cases, extra white space or different capitalization are used. In more complex cases, nouns are spelled differently or even a word (like a middle name) is left out. So, I need to somehow consolidate these textual variations describing the same thing into a single entity. Concrete example:
Promessa High Spirits
Promessa high Spirits
Promessa Hgh Spirits
Promessa high spirits
These all represent the same animal. I would like to "discover" that (or at least be alerted to the situation) without having to go through the 50000 records like this that I have. Thanks, Sean
Re: Fuzzy matching of text strings
by planetscape (Chancellor) on Dec 14, 2005 at 14:12 UTC
|
| [reply] |
Re: Fuzzy matching of text strings
by samizdat (Vicar) on Dec 14, 2005 at 13:46 UTC
|
| [reply] |
Re: Fuzzy matching of text strings
by ww (Archbishop) on Dec 14, 2005 at 14:35 UTC
|
Above represent good advice, but it may be profitable (efficient) to normalize the capitalization before dealing with the more complex fuzzy matching needs found in item 3.
I would be seriously inclined to see if lc'ing everything, and then uc'ing first letter of each word minimizes the work.
However, this scheme is suggested on the basis of one snippet of your data; if you have to distinguish between Mr. MacHinery and (something) Machinery *OR* if capitalization on the output need be not only consistent but also "correct" -- for unknown values of correct -- you will need something far better than this simple-minded scheme. | [reply] |
Re: Fuzzy matching of text strings
by TedPride (Priest) on Dec 14, 2005 at 19:38 UTC
|
Remove all leading and trailing whitespace; change all instances of more than one whitespace character to a space; lowercase; remove punctuation. Then create a 3-character frequency count and match it against the 3-character frequency count for existing fields. If the percentage match is over a certain level, add possible matches to an array. Sort by percentage match.
Now either choose a record (or field) to merge with, or mark the record as unique and move on. Hopefully you won't have to manually deal with too many records. | [reply] |
Re: Fuzzy matching of text strings
by qq (Hermit) on Dec 14, 2005 at 16:24 UTC
|
The String::Approx way may not work well for addresses. You could attempt to normalize them before doing a comparison: eg with Geo::StreetAddress::US or the like.
I've never actually tried this, and I can imagine its not always straightforward. But you might end up with better data.
update:typo and removed redundant sentance.
| [reply] |
Re: Fuzzy matching of text strings
by ruoso (Curate) on Dec 14, 2005 at 17:19 UTC
|
| [reply] |
|
Thanks all for the answers. Finding the similarity between two strings is one (probably the largest) component of my problem. However, there is another component--finding groups of "matches". I guess that I could do all possible pairs and look for similarity between them, forming a graph-like structure connecting "matches" to each other and then look for disconnected components or some such thing. Any thoughts on this second part of the problem? There are any number of possible ways to do it in practice (Graph.pm or even SQL could probably handle it), but it would be great to hear thoughts on the issue.
Thanks again, Sean
| [reply] |
|
In fact, the process of developing each of the test subroutines was based on the results of the comparision using a subset of the data. What I did, in that case, was continuosly creating new tests and outputting to a csv file A, B and the comparision score. I stopped when I got a good result of both a limit score and having few false positives and false negatives.
I think you could do it in the same way, no need for anything much sofisticated, just a subset of the database and many runs improving the type of tests you make.
| [reply] |
Re: Fuzzy matching of text strings
by Anonymous Monk on Dec 14, 2005 at 18:26 UTC
|
I'm not sure about other databases, but Microsft SQL Server has a function called SOUNDEX, which returns a "similarity" string that you can use for comparisons.
e.g.
SELECT description, SOUNDEX(description) AS SOUNDEX FROM mytable
description SOUNDEX
-------------------------------------------- -------
Promessa High Spirits P652
Promessa high Spirits P652
Promessa Hgh Spirits P652
Promessa high spirits P652
And now for something completely different A530
And now, for something completely different A530
And now! Something completely different A530
Something completely different now S535
Description of SOUNDEX from manual:
SOUNDEX converts an alpha string to a four-character code to find similar-sounding words or names. The first character of the code is the first character of character_expression and the second through fourth characters of the code are numbers. Vowels in character_expression are ignored unless they are the first letter of the string. String functions can be nested. | [reply] [d/l] |
|
Soundex is a great tool, but in this case it is not doing anything. The reason the first four descriptions in your sample return the same soundex code is because they only processed the "Promess" portion of each record.
Basically:
1. Grab the first letter:
String: Promessa H...
Soundex: P
2. Remove all vowels in remaining string:
String: rmssH
Soundex: P
3. Condense duplicate letters:
String: rmsH
Soundex: P
4. Assign 3 digits from l-r based on following key:
1. b,p,f,v
2. c,s,k,g,i,q,x,z
3. d,t
4. l
5. m,n
6. r
String: rmsH
Soundex: P6 (6 is for r)
String: msH
Soundex: P65 (5 is for m)
String: sH
Soundex: P652 (2 is for s)
DONE AT 3 DIGITS!!! GO NO FURTHER.
If there are consecutive characters from the same group, such as in the name "Duck", (c and k are both in group 2), the resulting soundex would be D200 (zeros are added to pad right if we run out of letters to change to numbers).
In summary, soundex is not appropriate for longer strings comparison. If you use it, the following would all be grouped as P652:
Promessa National Bank
Promessing Fertilizer Company
Promessa High Spirits
Promessing With Me
Hope this clears up Soundex for everyone.
| [reply] |
|
| [reply] |
Re: Fuzzy matching of text strings
by nothingmuch (Priest) on Dec 15, 2005 at 16:03 UTC
|
| [reply] [d/l] [select] |
Re: Fuzzy matching of text strings
by Moron (Curate) on Dec 15, 2005 at 15:20 UTC
|
(untested) (updated to add more transformations)
#!/usr/bin/perl
# usage: $0 [-d "delimiter"] file
use strict;
use warnings;
my $delimiter = '\t';
if ( $ARGV[0] =~ /^\-(.)$/ ) {
$_ = $ARGV[1];
if ( /^\"(.*)\"$/ ) {
$delimiter = $1;
shift @ARGV; shift @ARGV;
}
}
my $file = join( '', @ARGV );
my %seen = ();
open my $fh, "<$file" or die "$!: $file\n";
while( <$fh> ) {
chomp;
my @cols = split( /$delimiter/ );
for ( my $col=0; $col <= $#cols; $col++ ) {
Check ( \%seen, $col[$col], $., $col+1 );
}
}
close $fh;
sub Check {
my ( $sref, $col, $lno, $colno ) = @_;
my $lc = Transform( $col );
my $xref;
if ( $xref = $sref -> { $lc } ) {
( $sref -> { $lc }{ $col } )
or print "Column $colno, Line $lno: ($col) "
. "varies from "
. ( values %$xref );
}
else {
$sref -> { $lc }{ $col } = "Column $colno, Line $lno: ($col)\n
+";
}
}
sub Transform{
my $lc = lc( shift());
# example of an additional transformation...
$lc =~ s/\s*//g;
return $lc;
}
| [reply] [d/l] |
|
|