Which way is "better"

Recently I rediscovered the similar...

Before you spend a lot of time and effort trying to use a module, make sure it matters.

I've noticed I tend to over perlify/generify/engineer solutions since I've started to use Perl more frequently.

During a recent project I ended up needing to repeatably check the uniqueness of about 7.8 million strings of about 15 characters each (stored in a flat file). My first solution, which I rejected after a quick test, was to just populate a hash and make sure I never added an element that already existed. Goodbye swap & CPU.

The solution seemed to allow using Set::IntSpan, so I installed it and tried it out. Unfortunately, the sets were quite large and insertion time didn't scale (it took several hours to complete). Next I thought I just needed to use some kind of "database". So I tried out one or more of the "builtin" in ones (e.g. SDBM or something similar). Same problem.

I started thinking about getting more serious and installing SQL or a working version of DB_File or one of the "Sort" modules for sorting largefiles. Chances are good one of these would work, but I was running out of time and needed to get this working.

I realized that I needed to take a step backward and think about how I would do this without perl. I ended up just using the standard unix 'sort' utility and then having perl just run through the result and check to make sure no two adjacent lines matched. This seems fairly low-tech, but it works in less then 10 minutes with a resonable amount of memory rather than several hours and/or lots of memory (and lots of installation/debugging/etc).

The project is now over so I don't need something portable. Even if I did, I would probably do the same thing and leave the "optimization" (i.e. a supposedly cleaner solution) until after I had tested this further.

Comment on Which way is "better"