| dcz013 | dc restaurants | 0 | dcrestaurants
| dcz0013 | dc restaurants | 0 | dcrestaurants
| dcz013 | dc resturants | 1 |
| dcz0013 | dc resturants | 1 |
| dcz013 | dc american dining | 0 | dcamericandining
Yeah ... that's going to cause problems alright. Since each ID is bogus, you will need to create a
new one, but keep the old just in case. I suggest using an unsigned integer that is auto
incremented by the RDMS, but PHB's tend to like ID's with letters in them (don't listen to
'em!).
If i were in your shoes, i would create a new table and figure out a way to convert the
rows in the old table into the new. Prune as you go ... some of those rows have to be
redundant and incorrect. You will no doubt not get it right the first few attempts, so
prepare for that by having your script first DROP the new table and CREATE it from scratch.
Best of luck, this doesn't sound too fun ... :/
UPDATE:
OK, i think i might have a viable gameplan.
| [reply] [d/l] [select] |
1.4 million records is a lot, but you might just have enough memory to pull this off by using a hash to keep track of unique ID's (buying more RAM might be the thing to do)
When faced with this dilema, I reach for the AnyDBM_File module, which comes with perl. This way, your memory requirements for the hash turn in to disk space requirements, which are usually a lot more lax. YMMV.
thor
| [reply] |
Actually, you'd wanna select only the duplicate ones, since you don't want to deal with the ones that are fine already, right?
SELECT
DISTINCT *
FROM
TABLE_A as A,
TABLE_B as B
WHERE
A.ID = B.ID and
A.SECONDCOLUMN != B.SECONDCOLUMN
of course, doing a ..
SELECT
COUNT(*)
FROM
(
SELECT ID, COUNT(*) AS CNT FROM TABLE_A GROUP BY ID
) as COUNTS
WHERE
CNT = 1
will get you an idea of how many truely unique records there are.
++jeffa
Play that funky music white boy..
| [reply] [d/l] [select] |