3Re: Database Record Order

| dcz013   | dc  restaurants    | 0    | dcrestaurants
| dcz0013  | dc  restaurants    | 0    | dcrestaurants
| dcz013   | dc  resturants     | 1    |
| dcz0013  | dc  resturants     | 1    |
| dcz013   | dc american dining | 0    | dcamericandining

Yeah ... that's going to cause problems alright. Since each ID is bogus, you will need to create a new one, but keep the old just in case. I suggest using an unsigned integer that is auto incremented by the RDMS, but PHB's tend to like ID's with letters in them (don't listen to 'em!).

If i were in your shoes, i would create a new table and figure out a way to convert the rows in the old table into the new. Prune as you go ... some of those rows have to be redundant and incorrect. You will no doubt not get it right the first few attempts, so prepare for that by having your script first DROP the new table and CREATE it from scratch. Best of luck, this doesn't sound too fun ... :/

UPDATE:
OK, i think i might have a viable gameplan.

The monkey wrench in my gears were these rows:

| wq12351y059 | hawkinschemical | 1    |
| wq12366y059 | healtheon corp. | 1    |
| wq12367y059 | healthgatedata. | 1    |

and the big caveat is that i have no idea what FLAG is for ... but, here goes.

1.4 million records is a lot, but you might just have enough memory to pull this off by using a hash to keep track of unique ID's (buying more RAM might be the thing to do). Read each row one at a time:

my $sth = $dbh->prepare('SELECT * FROM [Production Words]');
$sth->execute;

# finish the rest of this line after you design the new table
my $new_sth = $dbh->prepare('INSERT INTO new_table ...');

my %hash;
while (my $row = $sth->fetchrow_hashref) {
[download]

Then, inside the while loop, pull apart each ID like so:

   my ($str,$num) = $row->{ID} =~ /(\D+)(\d+)/;
   my $new_id = lc($str) . int($num);
[download]

This will turn wq12351y059 into wq12351 and dcz0013 into dcz13.

Now we check to see if we have encountered this record before:

   unless ($hash{$new_id}++) {
      # INSERT $row into the new table 
      $new_sth->execute( ... );
   }
   else {
      # possibly fetch existing row and check to see
      # if this redundant old row has better data
   }
}
[download]

That's about the best i can think of right now. Again, best of luck. :)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on 3Re: Database Record Order Select or Download Code

Replies are listed 'Best First'.
Re: 3Re: Database Record Order by thor (Priest) on Dec 31, 2003 at 00:45 UTC
1.4 million records is a lot, but you might just have enough memory to pull this off by using a hash to keep track of unique ID's (buying more RAM might be the thing to do) When faced with this dilema, I reach for the AnyDBM_File module, which comes with perl. This way, your memory requirements for the hash turn in to disk space requirements, which are usually a lot more lax. YMMV. thor	[reply]
Re: 3Re: Database Record Order by exussum0 (Vicar) on Dec 31, 2003 at 04:38 UTC
Actually, you'd wanna select only the duplicate ones, since you don't want to deal with the ones that are fine already, right? `SELECT DISTINCT * FROM TABLE_A as A, TABLE_B as B WHERE A.ID = B.ID and A.SECONDCOLUMN != B.SECONDCOLUMN` [download] of course, doing a .. `SELECT COUNT() FROM ( SELECT ID, COUNT() AS CNT FROM TABLE_A GROUP BY ID ) as COUNTS WHERE CNT = 1` [download] will get you an idea of how many truely unique records there are. ++jeffa Play that funky music white boy..	[reply] [d/l] [select]