in reply to How do I measure my bottle ?

Almost all of those extra 5 seconds is spent allocating and reallocating memory as the hash is extended.

In theory, you could pre-extend your hash by using keys as an lvalue: keys %hash = 1234567;.

In reality, there are two problems with this.

  1. The number you assign is the number of "buckets" that are pre-allocated to the hash; but there is no way to pre-determine how many buckets your hash will need, even if you can predict how many keys you will assign.
  2. The space allocated to "buckets" is only a tiny proportion of the space required to build the hash.

    If you run the following 1-liner:

    perl -e"keys %hash = 2_000_000; <>; $hash{ $_ }++ for 1 .. 2_000_000; +<>;"

    When activity stops at the first input, the space allocated to the process will be around 18 MB.

    If you then hit enter, the hash is populated and the space requirement grows to around 192 MB.

Most of the time is spent allocating the space for the individual keys and values, not the buckets (it seems?).

In the past, I've tried various schemes to try and grab the required space from the OS in a single chunk rather than in zillions of iddy-biddy bits, but I haven't found a way of doing this using the memory allocator used by the AS built perl's.

You can very quickly grab a large chunk of memory from the OS using 'X' x ( 200*1024**2); for example.

D:\>perl -we"'X'x(200*1024**2); <>; $h{ $_ }++ for 1 .. 2_000_000;<>" Useless use of repeat (x) in void context at -e line 1. Name "main::h" used only once: possible typo at -e line 1.

Despite the "Useless use of repeat (x) in void context" message, you'll see that at the first prompt, the requried 200 MB of space has been allocated to the process (almost instantly), but when you hit enter, the process then goes on to grab 200MB more space.

Despite that the initial 200 MB was never assigned to anything--that space is never reused. I've tried in vain to find a way to pursuade perl to reuse it.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?

Replies are listed 'Best First'.
Re^2: How do I measure my bottle ?
by cbrain (Novice) on Mar 25, 2005 at 16:32 UTC

    Thanks you and your precious comments again!

    I have to state my task formally:

    My data files consist of two files transaction_info_file (TranactionID, ... personID ...) and Tranaction_detail_file (TranactionID, productID...) and they are related by one-to-many relationship. My task is to do some JOIN and SELECT/COUNT from those data. It sounds very simple before and I put everything in database with fine tuned indexes. However, size does matter.

    Now I use Perl to something like Hash-Join, scan all event_info_file once, put eventid in a seen-hash, then extrieve records from event_detail_file by looking up seen-hash.

    I've tried:

    • Database solution: SQL server, DB2, mySQL
    • Perl solution: Tie::File, Tie::Hash, DBI, my own implemented binary search.
    • another language: VB.NET
    Hash-Join solution is the best among them though still slow. I understand it would be easier if I have larger storage and a well-designed datawarehouse. Is it a must if I only want to know some simple questions such as who is the best buyer ? how many customers do i have ? (this is not a routine work as those bussiness user)

    by the way, i've tried pre-allocated method such as keys %hash=10_000_000,however, it seems no performance gain. Maybe this function is OS-dependent that it does not work on my windows-activeperl platform.

      There may be a faster way to do this, but you'll need to detail more information:

      • How many records in the transaction_info_file?
      • How many records in the Transaction_detail_file?
      • What fields are the records joined on?
      • What is the nature of those fields?
        • How long are they?
        • what form to they take?

          Eg. AAAnnnnmmmmmm etc. And which, if any, parts of those fields are invarient?

      • How often do you need to do this?
      • How long does it currently take?
      • What is your target for improvement?

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?

      If you know what criteria then you only need to remember the users who meet that criteria. It seems like you should be able to work that out somehow so that you are disregarding any info you don't need. Perhaps you could find a way to pair the info down some. Instead of storing everything in a database, just use the database to store your unique id's and the important totals. That should reduce the amount of space needed, and avoid haveing a giant hash in memory. I don't know how that would benchmark against your current strategy but it would seem that even if you can't load the whole data into the database you could figure out just enough to get your answers.

      Another solution might be to do a sort on the file thereby grouping your transactions togther, but my guess is that would be just as slow.

      Good luck and let us know if you find some cool solution ;)


      ___________
      Eric Hodges