shaolin_gungfu has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a little MySQL database - actually it's not that little, it has about 252,000 records in it. The trouble is many of these records contain duplicate information. I need to analyse each record and remove this duplicate information.

Obviously each record has it's own unique primary key, but I need to compare say 5 out of the total of 9 fields in each record and if they contain duplicate information, just drop the one.

Now I could do this by taking record 1, looping in perl and comparing it to every other record, then when I find one that matches, drop that record. But that would take ages!!!

I was just wondering if anyone knew a cool, quick and easy MySQL command that would do it for me...?<?p>

MySQL seems to be good at stuff like that.

Your hopefully,
Tom

Replies are listed 'Best First'.
Re: Removing duplicates in MySQL
by ferrency (Deacon) on May 01, 2002 at 14:58 UTC
    To get a list of duplicate records, you can join the table on itself. Something like this ought to do it:

    SELECT a.unique_id FROM my_table as a, my_table as b WHERE a.field1 = b.field1 and a.field2 = b.field2 and a.field3 = b.field3 and a.unique_id != b.unique_id
    This will return a list of all unique_id's where two rows share the same value of field1, field2, and field3. Note that it will return the unique_id of Both matching rows, not just one of them.

    I don't have a suitable dataset to experiment on, but you might be able to also do a group by the fields which must match, and select max(unique_id) instead, to grab one row ID from each group of matching stuff. You might still need to use perl to loop through and delete results, but at least you'll then be doing a linear operaton instead of exponential in perl.

    Update: changed unique_id to a.unique_id

    You can group by a.unique_id to make sure you only get one row per unique id. However, in the case where you know that a large portion of your rows have duplicates, you're right: this still doesn't give you any information you don't already have Slightly better would be select a.unique_id, b.unique_id so you at least know which row each row matches. Still less than ideal, I agree.

    Alan

      So this is going to tell you which unique id's have duplicates (and how many duplicates they have, as that'll be the number of times they're spat out), but not much more than that.

      A table with one set of three duplicates and one pair is going to return each of the first three id's twice, and each of the pair twice, for a total of ten records output, which then need much cleaning up.

      Nah, don't like it.

      --
      Tommy
      Too stupid to live.
      Too stubborn to die.

Re: Removing duplicates in MySQL
by tommyw (Hermit) on May 01, 2002 at 15:00 UTC

    You've got two different ways of doing this (I'm going to use the SQL I know, from Oracle, but the concepts the same, if MySQL doesn't support the same syntax):

    DELETE FROM source dup WHERE EXISTS ( SELECT NULL FROM source WHERE dup.field1=source.field1 AND dup.field2=source.field2 AND source.key<dup.key)
    which simply delete all records which have a higher key value, but are identical on fields 1 & 2 (extend the list as necessary). How efficient this'll be depends on your database. But all the information is there, so it could be your best bet.

    The other option, using perl is along the lines of:

    my $data=$dbh->prepare("SELECT key, field1, field2 FROM source ORDER B +Y field1, field2"); my $del=$dbh->prepare("DELETE FROM source WHERE key=?"); my @last; while (my @record=$data->fetchrow_array()) { if ($record[1] eq $last[1] && $record[2] eq $last[2]) { $del->execute($record[0]); } @last=@record; } $data->finish(); $del->finish();
    Disclaimer: I've just made this up, so I may have got the specifics wrong, but hopefully you get the general idea: keep pulling the records out, in order of the fields you want to match, and when you find a correspondance, kill the duplicate.

    --
    Tommy
    Too stupid to live.
    Too stubborn to die.