in reply to Re: redundancy Checker
in thread redundancy Checker

For example the user input a list of data for a school. Then another user input another list of data for that school, and then other user and so on. So how do I detect the redundant data? TQ

Replies are listed 'Best First'.
Re^3: redundancy Checker
by jhourcle (Prior) on Jul 27, 2005 at 03:10 UTC

    How do you determine redundant schools?

    I had to import data from a system, that might've had the 'University of Louisville Speed School' as 'UL' 'U of L' 'U Louisville' 'Univ. Louisville', 'Speed School', etc.

    If you're looking for exact string duplicates, it's fairly easy to just in SQL, assuming we're looking for duplicated entries of field1, field2:

    SELECT COUNT(*) AS duplicates, field1, field2 FROM some_table GROUP BY field1, field2 HAVING duplicates > 2

    Then you know which records to bother looking at, rather than having to go through the whole table.

Re^3: redundancy Checker
by GrandFather (Saint) on Jul 27, 2005 at 02:13 UTC

    At the time the second (redundant) data is entered you should notice that there is already an entry in the data base for the re-entered data.

    At that time you either throw away the redundant data or replace/edit the existing data base entry.

    Perhaps you need to show us the sort of code you have currently and explain where the problem is?


    Perl is Huffman encoded by design.

      You still have to define "redundant". If you properly normalize addresses (something almost no one does), then each street should have one and only one entry in the database. However, five guys with the first name of "John" should probably not have that abstracted away into a single entry. Just because the data looks the same does not mean that it's the same thing.

      Further, and this is a heresy that many database purists would be horrified by, there are times that DBAs will deliberately leave data denormalized for performance reasons (though this should not be done until you've gone down other avenues of correcting the problem).

      We may be able to be more specific if you can describe at a higher level the problem you're trying to solve.

      Cheers,
      Ovid

      New address of my CGI Course.

      Well I haven't code anything yet for the redundancy checker part. I am still planning on how best to do it. Array?

      I've don the data input part, but that just a simple SQL insert, and all the data are place in the database

      i.e.

      data1 | data2 | data3 | data4 | data5 |

      big small large medium good

      extra size bad small nice

        Ok, so give us a sample of the data, an indication of how much data there is, and the sort of redundancy checks you anticipate making.

        By the time you have done that you should have almost answered your own question unless you enter the realms of normalising nasty data (see Ovid's comment) or you end us with a huge amount of data.

        A hash would be the natural data type to store data in that is supposed to have unique keys


        Perl is Huffman encoded by design.