soon_j has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I have a MySQL table that is updated by multiple users. There are chances that an entire row may be repeated a few times. There are cases also where only some column entries from a row may be repeated.

Here's the table:
questions TEXT,
option_a TEXT,
option_b TEXT,
option_c TEXT,
option_d TEXT

I need to check the table if there are cases an entry is repeated, or when the same question is repeated but with different options, or etc.

The INSERT routine IGNOREs repeated entries, so it's unlikely that an entire row may be repeated. So the only issue left is the possibility that the same question is repeated but with different entries in the options columns (option_a ... option_d). I need a routine in MySQL/Perl that returns the rows (or just the id) of the duplicate entries for a particular column.

Hope someone could enlighten me on this matter. I just need a quick and fast query that does this.

Thanks!!!!

20040204 Edit by BazB: Changed title from 'Needs a good routine for finding duplicate entries from a MySQL Table'

  • Comment on OT: routine for finding duplicate entries from a MySQL Table

Replies are listed 'Best First'.
Re: OT: routine for finding duplicate entries from a MySQL Table
by Caron (Friar) on Feb 04, 2004 at 07:12 UTC

    You need some SQL here. The following query will show you which questions have a duplicate value.

    SELECT questions, CONCAT(option_a, option_b, option_c, option_d) AS options, COUNT(*) AS howmany FROM yourtable GROUP BY questions, options HAVING howmany > 1

    But if you simply want to delete the duplicates, then use ALTER TABLE to add a unique index to your table. Notice that you may need to qualify your columns with the number of characters to include. (See the appropriate manual page for details.)

    ALTER IGNORE TABLE yourtable ADD UNIQUE KEY (questions, option_a, option_b, option_c, option_d) +;

    This will also prevent duplicates to happen in the future.

      The idea to add a UNIQUE index is OK, but the way you propose it is not correct.

      You are adding a UNIQUE KEY which consists of both the questions field and all the option_? fields, which means that for the same question but with a different set of options, you get a different entry in the index and hence the "NOT UNIQUE" error doesn't trigger.

      The solution is therefore to add a UNIQUE KEY on the questions field only and then you can add only one record with this question and all later records with the same question (but with perhaps different options) will not get added to the database.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        Well, the problem is that the OP wasn't very clear about what is a duplicate record. He mentioned an ID, but it isn't in the table description, so I assumed that a primary key existed, apart the columns that were mentioned.

        Anyway, if the problem is (more trivially) to find out if several records have the same "questions" column, then it's even simpler:

        SELECT questions, COUNT(*) AS howmany FROM yourtable GROUP BY questions HAVING howmany > 1

        To remove such duplicates, on the same vein:

        ALTER IGNORE TABLE yourtable ADD UNIQUE KEY (questions)