in reply to How do I measure my bottle ?

Optimization is much more than just two variables. You haven't taken into consideration memory allocation, for example.

I'd recommend that you set a goal -- the time in which you want your program to complete by, given your existing hardware.

Then, you write your program, and if it doesn't meet the goal time, then you can work on optimization. I personally monitor what's going on using system utilities (vmstat, iostat, etc). I don't know enough about windows to offer recommentations on similar programs for windows

Also, from the info you've given (20M records, totalling 1GB), I'm guessing there is a chance that memory is your problem. You might want to use a tied hash, or just directly use a database.

Replies are listed 'Best First'.
Re^2: How do I measure my bottle ?
by tlm (Prior) on Mar 25, 2005 at 15:05 UTC

    One optimization I have seen used (though never profiled myself) is to tell perl how big a hash you will need:

    my %hash; keys %hash = 100_000;
    Without this line, perl basically has to build a new hash table several times as the number of keys grows. (Think of it as the hash counterpart of pre-growing an array:
    my @array; $#array = 99_999;
    )

    As I said, I've only seen this while reading source code; I don't know how much it really gets you.

    the lowliest monk

Re^2: How do I measure my bottle ?
by cbrain (Novice) on Mar 25, 2005 at 13:59 UTC

    Thanks you all. Now I got a better direction, since I have no experience working in Unix and administrative work. Actually, I've tried Tie::File, database, Tie::Hash before, every one has its problem that is not feasible for me. Tie::File and Tie::Hash are very slow (20 hours reading those files). However, puting 1,000 GB (1 TB) in database (i've tried DB2, SQL server, mySQL) takes me quite a few space that is not an affordable solution for me.

    thanks again, and happy weekend !

      I think you hit upon (and summarily eliminated) the best solution: a database. You have 1TB of data. It's stored in flat files? Properly, this data should have never been put into files, but directly into a database, and you would simply query for the data you want as you want it. Yes, it takes up a fair bit more space than flat files. But you're trading space for speed. Disk space is cheap, CPU speed not so cheap.

      You say you're using an AMD64 machine. Are you running a 64-bit OS on it? If not, you may want to try that first - that may help your I/O speed somewhat, and probably will help your ability to use all your memory.

      Once you're using a 64-bit OS, it's time to get a 64-bit perl. With a 32-bit perl, you'll run out of address space long before you can load your data in.

      Finally, then you can get a 64-bit database. I know, I know, I'm harping on this. But, let's face it. You have 1TB of data you're trying to work with, but only 2GB of RAM. The other 998GB of data will simply get swapped out to disk while you're loading it from disk. This is going to be incredibly!!!! slow. Use a database - it has algorithms and code that are intended to deal with this type of problem, written in highly-optimised (if you believe the TPC ratings) C code. Load data as you need it, discard it when you're done with it. Put as much logic as you can into your SQL statements, let the database handle getting your data in the most efficient manner possible.

      I really, honestly, think that if you cannot afford the database storage, you can't afford any solution. Storage is relatively cheap, and trying to load everything into memory is simply going to fail. The Tie::* modules are likely your next best bet as they probably also load/discard data as needed, allowing you to live within your 2GB of RAM.