Re^2: How do I measure my bottle ?

Thanks you and your precious comments again!

I have to state my task formally:

My data files consist of two files transaction_info_file (TranactionID, ... personID ...) and Tranaction_detail_file (TranactionID, productID...) and they are related by one-to-many relationship. My task is to do some JOIN and SELECT/COUNT from those data. It sounds very simple before and I put everything in database with fine tuned indexes. However, size does matter.

Now I use Perl to something like Hash-Join, scan all event_info_file once, put eventid in a seen-hash, then extrieve records from event_detail_file by looking up seen-hash.

I've tried:

Database solution: SQL server, DB2, mySQL
Perl solution: Tie::File, Tie::Hash, DBI, my own implemented binary search.
another language: VB.NET

Hash-Join solution is the best among them though still slow. I understand it would be easier if I have larger storage and a well-designed datawarehouse. Is it a must if I only want to know some simple questions such as who is the best buyer ? how many customers do i have ? (this is not a routine work as those bussiness user)

by the way, i've tried pre-allocated method such as keys %hash=10_000_000,however, it seems no performance gain. Maybe this function is OS-dependent that it does not work on my windows-activeperl platform.

Comment on Re^2: How do I measure my bottle ?

Replies are listed 'Best First'.
Re^3: How do I measure my bottle ? by BrowserUk (Patriarch) on Mar 25, 2005 at 19:32 UTC
There may be a faster way to do this, but you'll need to detail more information: How many records in the transaction_info_file? How many records in the Transaction_detail_file? What fields are the records joined on? What is the nature of those fields? How long are they? what form to they take? Eg. AAAnnnnmmmmmm etc. And which, if any, parts of those fields are invarient? How often do you need to do this? How long does it currently take? What is your target for improvement? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply]
Re^3: How do I measure my bottle ? by eric256 (Parson) on Mar 25, 2005 at 21:50 UTC
If you know what criteria then you only need to remember the users who meet that criteria. It seems like you should be able to work that out somehow so that you are disregarding any info you don't need. Perhaps you could find a way to pair the info down some. Instead of storing everything in a database, just use the database to store your unique id's and the important totals. That should reduce the amount of space needed, and avoid haveing a giant hash in memory. I don't know how that would benchmark against your current strategy but it would seem that even if you can't load the whole data into the database you could figure out just enough to get your answers. Another solution might be to do a sort on the file thereby grouping your transactions togther, but my guess is that would be just as slow. Good luck and let us know if you find some cool solution ;) ___________ Eric Hodges	[reply]