Re: How do I measure my bottle ?
by Fletch (Bishop) on Mar 25, 2005 at 13:29 UTC
|
If this were a real OS :) I'd say you need to use whatever your particular flavour of iostat is to get statistics on things such as paging and how much time you're spending in disk wait (and for Solaris recommend a copy of the Porsche book by Cockcroft). I'm guessing there's probably some analogue for Windows (other than the Task Manager) that should be able to collect similar statistics.
And "Perl", not "PERL".
| [reply] |
Re: How do I measure my bottle ?
by RazorbladeBidet (Friar) on Mar 25, 2005 at 13:33 UTC
|
| [reply] |
|
|
Thanks for your rapid reply, my test script is listed as following:
script 1:
while(my $l=<IN>){
}
script 2:
while(my $l=<IN>){
my $id=substr($l,0, 33);
$hash{$id}=1;
}
script 1 takes me around 15 secs where as script2 takes me 20 secs.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
while(my $l=<IN>){
my $id=substr($l,0,33);
$hash=1;
}
(Assuming of course, the compiler doesn't optimize any of this away.)
the lowliest monk
| [reply] [d/l] |
Re: How do I measure my bottle ?
by jhourcle (Prior) on Mar 25, 2005 at 13:39 UTC
|
Optimization is much more than just two variables. You haven't taken into consideration memory allocation, for example.
I'd recommend that you set a goal -- the time in which you want your program to complete by, given your existing hardware.
Then, you write your program, and if it doesn't meet the goal time, then you can work on optimization. I personally monitor what's going on using system utilities (vmstat, iostat, etc). I don't know enough about windows to offer recommentations on similar programs for windows
Also, from the info you've given (20M records, totalling 1GB), I'm guessing there is a chance that memory is your problem. You might want to use a tied hash, or just directly use a database.
| [reply] |
|
|
my %hash;
keys %hash = 100_000;
Without this line, perl basically has to build a new hash table several times as the number of keys grows. (Think of it as the hash counterpart of pre-growing an array:
my @array;
$#array = 99_999;
)
As I said, I've only seen this while reading source code; I don't know how much it really gets you.
the lowliest monk
| [reply] [d/l] [select] |
|
|
Thanks you all. Now I got a better direction, since I have no experience working in Unix and administrative work. Actually, I've tried Tie::File, database, Tie::Hash before, every one has its problem that is not feasible for me. Tie::File and Tie::Hash are very slow (20 hours reading those files). However, puting 1,000 GB (1 TB) in database (i've tried DB2, SQL server, mySQL) takes me quite a few space that is not an affordable solution for me.
thanks again, and happy weekend !
| [reply] |
|
|
I think you hit upon (and summarily eliminated) the best solution: a database. You have 1TB of data. It's stored in flat files? Properly, this data should have never been put into files, but directly into a database, and you would simply query for the data you want as you want it. Yes, it takes up a fair bit more space than flat files. But you're trading space for speed. Disk space is cheap, CPU speed not so cheap.
You say you're using an AMD64 machine. Are you running a 64-bit OS on it? If not, you may want to try that first - that may help your I/O speed somewhat, and probably will help your ability to use all your memory.
Once you're using a 64-bit OS, it's time to get a 64-bit perl. With a 32-bit perl, you'll run out of address space long before you can load your data in.
Finally, then you can get a 64-bit database. I know, I know, I'm harping on this. But, let's face it. You have 1TB of data you're trying to work with, but only 2GB of RAM. The other 998GB of data will simply get swapped out to disk while you're loading it from disk. This is going to be incredibly!!!! slow. Use a database - it has algorithms and code that are intended to deal with this type of problem, written in highly-optimised (if you believe the TPC ratings) C code. Load data as you need it, discard it when you're done with it. Put as much logic as you can into your SQL statements, let the database handle getting your data in the most efficient manner possible.
I really, honestly, think that if you cannot afford the database storage, you can't afford any solution. Storage is relatively cheap, and trying to load everything into memory is simply going to fail. The Tie::* modules are likely your next best bet as they probably also load/discard data as needed, allowing you to live within your 2GB of RAM.
| [reply] |
Re: How do I measure my bottle ?
by jdporter (Paladin) on Mar 25, 2005 at 14:13 UTC
|
I agree that the problem as stated is fairly interesting. However, I wonder of maybe we have an X/Y situation here: rather than figuring out how to find the bottleneck(s) in your current algorithm, perhaps we could find a better algorithm for your underlying problem. Which is... ?
| [reply] |
Re: How do I measure my bottle ?
by BrowserUk (Patriarch) on Mar 25, 2005 at 15:22 UTC
|
Almost all of those extra 5 seconds is spent allocating and reallocating memory as the hash is extended.
In theory, you could pre-extend your hash by using keys as an lvalue: keys %hash = 1234567;.
In reality, there are two problems with this.
- The number you assign is the number of "buckets" that are pre-allocated to the hash; but there is no way to pre-determine how many buckets your hash will need, even if you can predict how many keys you will assign.
- The space allocated to "buckets" is only a tiny proportion of the space required to build the hash.
If you run the following 1-liner:
perl -e"keys %hash = 2_000_000; <>; $hash{ $_ }++ for 1 .. 2_000_000;
+<>;"
When activity stops at the first input, the space allocated to the process will be around 18 MB.
If you then hit enter, the hash is populated and the space requirement grows to around 192 MB.
Most of the time is spent allocating the space for the individual keys and values, not the buckets (it seems?).
In the past, I've tried various schemes to try and grab the required space from the OS in a single chunk rather than in zillions of iddy-biddy bits, but I haven't found a way of doing this using the memory allocator used by the AS built perl's.
You can very quickly grab a large chunk of memory from the OS using 'X' x ( 200*1024**2); for example.
D:\>perl -we"'X'x(200*1024**2); <>; $h{ $_ }++ for 1 .. 2_000_000;<>"
Useless use of repeat (x) in void context at -e line 1.
Name "main::h" used only once: possible typo at -e line 1.
Despite the "Useless use of repeat (x) in void context" message, you'll see that at the first prompt, the requried 200 MB of space has been allocated to the process (almost instantly), but when you hit enter, the process then goes on to grab 200MB more space.
Despite that the initial 200 MB was never assigned to anything--that space is never reused. I've tried in vain to find a way to pursuade perl to reuse it.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?
| [reply] [d/l] [select] |
|
|
Thanks you and your precious comments again!
I have to state my task formally:
My data files consist of two files transaction_info_file (TranactionID, ... personID ...) and Tranaction_detail_file (TranactionID, productID...) and they are related by one-to-many relationship. My task is to do some JOIN and SELECT/COUNT from those data. It sounds very simple before and I put everything in database with fine tuned indexes. However, size does matter.
Now I use Perl to something like Hash-Join, scan all event_info_file once, put eventid in a seen-hash, then extrieve records from event_detail_file by looking up seen-hash. I've tried:
- Database solution: SQL server, DB2, mySQL
- Perl solution: Tie::File, Tie::Hash, DBI, my own implemented binary search.
- another language: VB.NET
Hash-Join solution is the best among them though still slow. I understand it would be easier if I have larger storage and a well-designed datawarehouse. Is it a must if I only want to know some simple questions such as who is the best buyer ? how many customers do i have ? (this is not a routine work as those bussiness user)
by the way, i've tried pre-allocated method such as keys %hash=10_000_000,however, it seems no performance gain. Maybe this function is OS-dependent that it does not work on my windows-activeperl platform.
| [reply] |
|
|
| [reply] |
|
|
If you know what criteria then you only need to remember the users who meet that criteria. It seems like you should be able to work that out somehow so that you are disregarding any info you don't need. Perhaps you could find a way to pair the info down some. Instead of storing everything in a database, just use the database to store your unique id's and the important totals. That should reduce the amount of space needed, and avoid haveing a giant hash in memory. I don't know how that would benchmark against your current strategy but it would seem that even if you can't load the whole data into the database you could figure out just enough to get your answers.
Another solution might be to do a sort on the file thereby grouping your transactions togther, but my guess is that would be just as slow.
Good luck and let us know if you find some cool solution ;)
| [reply] |
Re: How do I measure my bottle ?
by NateTut (Deacon) on Mar 25, 2005 at 20:10 UTC
|
IMNSHO the Teradata database is the way to go if you can afford it. It is a proprietary relational DB (with a nice DBI/DBD for Perl). It is a MPP system with multiple nodes that all run in parallel to service each request. It has loading utilities that can load data files such as you describe very quickly.
Probably way over your budget though...
| [reply] |