Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to build a redundancy checker using Perl. How do I detect the data is redundant from my database(MySQL).

Replies are listed 'Best First'.
Re: redundancy Checker
by GrandFather (Saint) on Jul 27, 2005 at 01:38 UTC

    First ask yourself "what comprises redundant data?".


    Perl is Huffman encoded by design.
      For example the user input a list of data for a school. Then another user input another list of data for that school, and then other user and so on. So how do I detect the redundant data? TQ

        How do you determine redundant schools?

        I had to import data from a system, that might've had the 'University of Louisville Speed School' as 'UL' 'U of L' 'U Louisville' 'Univ. Louisville', 'Speed School', etc.

        If you're looking for exact string duplicates, it's fairly easy to just in SQL, assuming we're looking for duplicated entries of field1, field2:

        SELECT COUNT(*) AS duplicates, field1, field2 FROM some_table GROUP BY field1, field2 HAVING duplicates > 2

        Then you know which records to bother looking at, rather than having to go through the whole table.

        At the time the second (redundant) data is entered you should notice that there is already an entry in the data base for the re-entered data.

        At that time you either throw away the redundant data or replace/edit the existing data base entry.

        Perhaps you need to show us the sort of code you have currently and explain where the problem is?


        Perl is Huffman encoded by design.
Re: redundancy Checker
by spiritway (Vicar) on Jul 27, 2005 at 02:48 UTC

    One way that comes to mind is to use a sorted list of the data. When you get a new element, find its place in the list. If it's already there, then you don't need to add it. If it's not there, then add it in the proper place.

    HTH.

      Um, wouldn't that be faster by just inserting into a hash - when you get a new element, see if it's already in the hash?

      However, the OP is really asking much more about SQL than perl. Not that it's off-topic in my opinion, but answers should probably focus on the OP's problem area. Lists and hashes are unlikely to be the best solutions.

      The problem really is in defining what "duplicate" is. And then devising either tables or SQL queries (or some combination of both) to expose those duplicates. Even using lists or hashes, we still would need a better definition of what "duplicate" means to know what to sort on, or what to use as the hash key.

Re: redundancy Checker
by sk (Curate) on Jul 27, 2005 at 04:18 UTC
    From your question it does not look like you are evaluating Perl approach vs SQL. But anyways here is my take on redundancy - If you want Perl approach, I would use a Hash as tanktalus has mentioned above. This assumes your data is not HUGE! If so then you want keep the information indexed/sorted so that you can check for the key very quickly (also mentioned above).

    If you are looking for SQL way then you should consider setting that field as UNIQUE. If you try to update a field that will vioate this unique constraint MYSQL will throw a warning and not update (assuming you will not be using the IGNORE option). If you are updating the data from Perl then you should be able to trap the warning/failure to execute query and report back to the user.  show warnings in MYSQL should give the warning that was generated.

    cheers

    SK

Re: redundancy Checker
by TedPride (Priest) on Jul 27, 2005 at 06:43 UTC
    First you have to define what makes a school unique. In this case, probably the combination of state, city, or zip code (depending on how detailed the information is) and school name. Second, you have to standardize this information. Capitalization is your biggest problem - put the important fields in all caps. Third, make sure mySQL recognizes that the combination of these fields has to be unique. A UNIQUE INDEX will prevent multiple records being submitted with the same field combination.

    Fourth, duplicates will still get through due to misspellings or name variations. Make sure you have an easy way to list all the schools in a particular zip in alphabetic order, so you can identify duplicate entries with a minimum of trouble.

    Fifth, spam will also be a problem. Keep a cache of the last x number of submissions with IP and timestamp, and reject submissions from the same IP within x seconds of its last submission. You can also keep a log of submission attempts and run a secondary routine every now and then to identify bots.