Another Endless Loop

I always enjoy a good bug-hunting story. Sometimes they can even teach you something, even if it's just not to do that again! This one isn't the most interesting, but I found it instructive.

I wrote a script some time ago to sanitize a list of email addresses for use in demos. The idea was to take a real list of email addresses and turn them into something safe. The easiest way I could think of to do this was to change the domain to a known safe one, like our own domain where I work: plusthree.com. The first version went something like (ignoring chunking, transactions, error checking):

        my $sth = 
          $dbh->prepare('UPDATE person SET email = ? WHERE person_id =
+ ?');
        foreach my $id (@ids) {
            my $email = $email_addresses->[$id];
            $email =~ s!\@.*!\@plusthree.com!;
            $sth->execute($email, $id);
        }
[download]

This worked fine until it hit the first duplicate. Email addresses in this DB have to be unique. When bob@microsoft.com and bob@villa.com both become bob@plusthree.com - boom! I figured I'd just have the code trap the error and put an underscore after the address until it worked, producing bob@plusthree.com, bob_@plusthree.com, bob__@plusthree.com, etc. The code looked like:

        my $sth = 
          $dbh->prepare('UPDATE person SET email = ? WHERE person_id =
+ ?');
        foreach my $id (@ids) {
            my $email = $email_addresses->[$id];
            $email =~ s!\@.*!\@plusthree.com!;

            while (1) {
                eval { $sth->execute($email, $id) };
                last if not $@;

                if ($@ =~ /duplicate/i) {
                    $email =~ s!\@!_@!;
                } else {
                    die $@;
                }
            }
        }
[download]

This worked great - until it didn't. One day I ran the script and it got halfway through a big list of emails and then stopped. Well, actually it didn't stop, it just stopped making progress. Looking at 'top' I could see MySQL was plenty busy so I decided to take a peek at the running query (using mytop, incidentally). I found:

   UPDATE person SET email = 'info___...___@plusthree.com' WHERE perso
+n_id = 408700;
[download]

Where I put in ... there were around 100,000 underscores. Ouch!

The cause was obvious once I saw the query. The email column is only 255 characters wide which means each prefix can only have a maximum of 254 dups (or less) before MySQL can't tell the difference and starts returning a duplicate error for every one.

The fix was fairly simple - I replaced the underscores with a numeric count. That should give the code plenty of room to breath.

I see a few lessons here. 1) Look hard at any while(1) loops in your code - be sure the exit condition will definitely be hit. 2) Beware of column overflow when expecting your database to detect duplicates - the DB can only check what it can store. And, perhaps, 3) switch to a DB that throws an error on column size violations rather than silently ignoring them. I'm hoping MySQL's new strict-mode will fullfill #3 but I haven't played with it yet.

-sam

Comment on Another Endless Loop Select or Download Code

Replies are listed 'Best First'.
Re: Another Endless Loop by ysth (Canon) on Feb 23, 2007 at 08:59 UTC
Just a note: there are a number of domain names that are reserved for using as examples in documentation, for DNS testing, or for constructing domain names guaranteed to be invalid: example.com; example.net; example.org; and anything under the top-level domains .test, .example, or .invalid. See RFC 2606. I'm looking for a job.	[reply]
Re: Another Endless Loop by itub (Priest) on Feb 23, 2007 at 09:46 UTC
I'm not sure exactly what do you mean by "safe", but perhaps a simpler solution would be to just append your domain name, turning bob@microsoft.com into bob@microsoft.com.plusthree.com? (The lessons are still good, of course.)	[reply]
Re^2: Another Endless Loop by ikegami (Patriarch) on Feb 23, 2007 at 19:24 UTC
Or if the intent is to hide the original domain, a hash or the original domain could be used instead.	[reply]
Re^3: Another Endless Loop by Fletch (Bishop) on Feb 23, 2007 at 19:49 UTC
And if you still want something "meaningful" map the hash into a word, or two words joined together, from a word list (like `/usr/dict/words`).	[reply]
Re^4: Another Endless Loop by samtregar (Abbot) on Feb 24, 2007 at 23:50 UTC
Re^5: Another Endless Loop by Fletch (Bishop) on Feb 25, 2007 at 16:51 UTC
Some notes below your chosen depth have not been shown here
Re^2: Another Endless Loop by samtregar (Abbot) on Feb 24, 2007 at 23:54 UTC
I meant "safe" in terms of not violating anyone's privacy. These email addresses are real ones and revealing them to other clients in a sales demo seems like a bad thing. If it's not actually illegal it probably violates our privacy policy! Just sticking something on the end doesn't seem like enough. -sam	[reply]
[OT] Re: Another Endless Loop by crashtest (Curate) on Feb 23, 2007 at 22:03 UTC
I am astounded that MySQL silently truncates your data if the column isn't wide enough. I had to see it to believe it, so I connected to the instance running on my machine and tried a little example. Sure enough: `mysql> describe x; +-------+---------+------+-----+---------+-------+ \| Field \| Type \| Null \| Key \| Default \| Extra \| +-------+---------+------+-----+---------+-------+ \| y \| char(3) \| YES \| \| NULL \| \| +-------+---------+------+-----+---------+-------+ 1 row in set (0.00 sec) mysql> insert into x values ('abcd'); Query OK, 1 row affected (0.00 sec) mysql> select y from x; +------+ \| y \| +------+ \| abc \| +------+ 1 row in set (0.03 sec)` [download] I am particularly offended by the message `Query OK`, when it clearly isn't. In what universe is it acceptable for a database - an application centered on the storage of data - to design an interface that makes it possible for the data you input to be different from the data it outputs? I am honestly curious if other databases exhibit this type of behavior. Is there some sort of precedence for this in older RDBMSs? I can't quite wrap my mind around it. My understanding was that although MySql started from humble beginnings, it was sufficiently advanced to see use in many production settings, and was quite popular. To soothe my nerves, I tried the equivalent operations in Oracle, and to my relief saw `ERROR at line 1: ORA-01401: inserted value too large for column` [download] I don't use MySql for production use (just for personal tinkering around), and think this has probably scared me off.	[reply] [d/l] [select]
Re: [OT] Re: Another Endless Loop by almut (Canon) on Feb 24, 2007 at 01:36 UTC
... I am particularly offended by the message Query OK, when it clearly isn't. I agree with you that this is questionable default behaviour, but at least it's documented, and (as of version 5.0.2) it seems you can enable a "strict mode" to have it error out: "If you assign a value to a CHAR or VARCHAR column that exceeds the column's maximum length, the value is truncated to fit. If the truncated characters are not spaces, a warning is generated. For truncation of non-space characters, you can cause an error to occur (rather than a warning) and suppress insertion of the value by using strict SQL mode. See Section 5.2.6, "SQL Modes"."	[reply]
Re^2: [OT] Re: Another Endless Loop by samtregar (Abbot) on Feb 24, 2007 at 23:48 UTC
Awesome. I was hoping strict mode would do that. We're going to v5.x soon and I'm definitely going to want to turn that on! -sam	[reply]
Re^3: [OT] Re: Another Endless Loop by dragonchild (Archbishop) on Feb 26, 2007 at 05:15 UTC
Re: MySQL foibles by hobbs (Monk) on Feb 26, 2007 at 22:53 UTC
MySQL became "good enough for production use" through a process of redefinition -- lots of people use MySQL in production, so obviously it's good enough for them, right? Seriously though, it has improved, a lot, over the years, but there are still places where not only does it do something stupid and harmful, but it doesn't even consider it important enough to tell you that it happened. You've found one of those. Hopefully the "strict mode" business eliminates most of those problems, but you know as well as I that the people who need it most won't use it at all. :\|	[reply]
Re: Another Endless Loop by Cody Pendant (Prior) on Feb 27, 2007 at 05:26 UTC
I ran into the same kind of error with numbers -- when you try to put a number over 127 into a `tinyint` column, it "truncates" it to 127 and gives no warning. The problem was clearly my fault, but I was amazed that I could type `"set number_column = 128 where foo"` and it would set it to 127 and report nothing out of the ordinary. I kind of understand the truncating of strings, but reducing numbers to fit seemed like it had violated a really important principle. Anyone know how to set that SQL Mode when working with DBI? ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]