Re^2: redundancy Checker

Replies are listed 'Best First'.
Re^3: redundancy Checker by jhourcle (Prior) on Jul 27, 2005 at 03:10 UTC
How do you determine redundant schools? I had to import data from a system, that might've had the 'University of Louisville Speed School' as 'UL' 'U of L' 'U Louisville' 'Univ. Louisville', 'Speed School', etc. If you're looking for exact string duplicates, it's fairly easy to just in SQL, assuming we're looking for duplicated entries of field1, field2: `SELECT COUNT(*) AS duplicates, field1, field2 FROM some_table GROUP BY field1, field2 HAVING duplicates > 2` [download] Then you know which records to bother looking at, rather than having to go through the whole table.	[reply] [d/l]
Re^3: redundancy Checker by GrandFather (Saint) on Jul 27, 2005 at 02:13 UTC
At the time the second (redundant) data is entered you should notice that there is already an entry in the data base for the re-entered data. At that time you either throw away the redundant data or replace/edit the existing data base entry. Perhaps you need to show us the sort of code you have currently and explain where the problem is? Perl is Huffman encoded by design.	[reply]
Re^4: redundancy Checker by Ovid (Cardinal) on Jul 27, 2005 at 03:01 UTC
You still have to define "redundant". If you properly normalize addresses (something almost no one does), then each street should have one and only one entry in the database. However, five guys with the first name of "John" should probably not have that abstracted away into a single entry. Just because the data looks the same does not mean that it's the same thing. Further, and this is a heresy that many database purists would be horrified by, there are times that DBAs will deliberately leave data denormalized for performance reasons (though this should not be done until you've gone down other avenues of correcting the problem). We may be able to be more specific if you can describe at a higher level the problem you're trying to solve. Cheers, Ovid New address of my CGI Course.	[reply]
Re^4: redundancy Checker by Anonymous Monk on Jul 27, 2005 at 03:04 UTC
Well I haven't code anything yet for the redundancy checker part. I am still planning on how best to do it. Array? I've don the data input part, but that just a simple SQL insert, and all the data are place in the database i.e. data1 \| data2 \| data3 \| data4 \| data5 \| big small large medium good extra size bad small nice	[reply]
Re^5: redundancy Checker by GrandFather (Saint) on Jul 27, 2005 at 03:23 UTC
Ok, so give us a sample of the data, an indication of how much data there is, and the sort of redundancy checks you anticipate making. By the time you have done that you should have almost answered your own question unless you enter the realms of normalising nasty data (see Ovid's comment) or you end us with a huge amount of data. A hash would be the natural data type to store data in that is supposed to have unique keys Perl is Huffman encoded by design.	[reply]