cbrain has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

Here is my scenario: I have hundreds of plain text files, and they are 1,000 GB totally. My stask is very simple, do a sequential reading by <IN> once, and put some variables in Hash (maybe 20,000,000 records). To determine the bottleneck of my task, I perform two little experiments as below:

  1. simply read a file (500mb) sequentially by <IN> takes around 15 secs. After I add one line such as $hash{$_}=1, it takes me around 20 secs. ---> it's reasonable to think the bottle neck is CPU's computing power.
  2. After promoting my PERL's priority to High, the PERL thread's CPU utilization reduces to 90% only ---> maybe it implies that my system's IO is too slow to feed data to my CPU ?

I'm finding an utility/script/method to examine my real bottle neck IO (hardware) or CPU (perl script) ? It seems that the traditional Benchmark helps little. Any advice would be very very appreciated!

by the way, my system's hardware configuration is: Athlon 64 2800 with 2GB RAM. UDMA 133 IDE harddisk. OS Windows 2003.

20050404 Unconsidered by Corion. Was considered by RazorbladeBidet: Retitle: Determining the performance bottleneck on Windows (edit:45 keep:11 del:0)

Replies are listed 'Best First'.
Re: How do I measure my bottle ?
by Fletch (Bishop) on Mar 25, 2005 at 13:29 UTC

    If this were a real OS :) I'd say you need to use whatever your particular flavour of iostat is to get statistics on things such as paging and how much time you're spending in disk wait (and for Solaris recommend a copy of the Porsche book by Cockcroft). I'm guessing there's probably some analogue for Windows (other than the Task Manager) that should be able to collect similar statistics.

    And "Perl", not "PERL".

Re: How do I measure my bottle ?
by RazorbladeBidet (Friar) on Mar 25, 2005 at 13:33 UTC
    There is a performance manager in the administrative tools for most recent versions of Windows (not sure what Win 2003 is?)... that would probably be your simplest bet. You can view it on a per-process basis. I'm sure there's a perl module to access this data if you so choose.

    However, I'm a little confused. You're slurping a file into a scalar and using a 500mb hash key? or the first line of that file takes 15 seconds to read? Also, does the read take 15 seconds and hash insert take 20 seconds (35 seconds total) or is the total 20 seconds?

    One option would be to have a multi-threaded program where one thread reads the file and the other works on the hash.
    --------------
    "But what of all those sweet words you spoke in private?"
    "Oh that's just what we call pillow talk, baby, that's all."
      Thanks for your rapid reply, my test script is listed as following:

      script 1:

      while(my $l=<IN>){ }

      script 2:

      while(my $l=<IN>){ my $id=substr($l,0, 33); $hash{$id}=1; }

      script 1 takes me around 15 secs where as script2 takes me 20 secs.

        Then your hash insert is only taking 5 seconds (all other things being equal).

        There is the memory consideration, also (as stated below).

        Is this 20M records totalling 1GB or 1 TeraByte? (You mention 1,000 GB in your original post).

        Is there a reason you are using a hash? (in your example it looks like you could use an array, but I understand it is merely a "test")

        If you have many files (and it sounds like you do) - you could slurp in the entire file (one file at a time) and do the inserts, which will increase your memory usage but decrease CPU time. See File::Slurp
        --------------
        "But what of all those sweet words you spoke in private?"
        "Oh that's just what we call pillow talk, baby, that's all."

        If you want a handle on the cost of hash inserts, you may as well make the comparison more precise by making script 1 do something like:

        while(my $l=<IN>){ my $id=substr($l,0,33); $hash=1; }
        (Assuming of course, the compiler doesn't optimize any of this away.)

        the lowliest monk

Re: How do I measure my bottle ?
by jhourcle (Prior) on Mar 25, 2005 at 13:39 UTC

    Optimization is much more than just two variables. You haven't taken into consideration memory allocation, for example.

    I'd recommend that you set a goal -- the time in which you want your program to complete by, given your existing hardware.

    Then, you write your program, and if it doesn't meet the goal time, then you can work on optimization. I personally monitor what's going on using system utilities (vmstat, iostat, etc). I don't know enough about windows to offer recommentations on similar programs for windows

    Also, from the info you've given (20M records, totalling 1GB), I'm guessing there is a chance that memory is your problem. You might want to use a tied hash, or just directly use a database.

      One optimization I have seen used (though never profiled myself) is to tell perl how big a hash you will need:

      my %hash; keys %hash = 100_000;
      Without this line, perl basically has to build a new hash table several times as the number of keys grows. (Think of it as the hash counterpart of pre-growing an array:
      my @array; $#array = 99_999;
      )

      As I said, I've only seen this while reading source code; I don't know how much it really gets you.

      the lowliest monk

      Thanks you all. Now I got a better direction, since I have no experience working in Unix and administrative work. Actually, I've tried Tie::File, database, Tie::Hash before, every one has its problem that is not feasible for me. Tie::File and Tie::Hash are very slow (20 hours reading those files). However, puting 1,000 GB (1 TB) in database (i've tried DB2, SQL server, mySQL) takes me quite a few space that is not an affordable solution for me.

      thanks again, and happy weekend !

        I think you hit upon (and summarily eliminated) the best solution: a database. You have 1TB of data. It's stored in flat files? Properly, this data should have never been put into files, but directly into a database, and you would simply query for the data you want as you want it. Yes, it takes up a fair bit more space than flat files. But you're trading space for speed. Disk space is cheap, CPU speed not so cheap.

        You say you're using an AMD64 machine. Are you running a 64-bit OS on it? If not, you may want to try that first - that may help your I/O speed somewhat, and probably will help your ability to use all your memory.

        Once you're using a 64-bit OS, it's time to get a 64-bit perl. With a 32-bit perl, you'll run out of address space long before you can load your data in.

        Finally, then you can get a 64-bit database. I know, I know, I'm harping on this. But, let's face it. You have 1TB of data you're trying to work with, but only 2GB of RAM. The other 998GB of data will simply get swapped out to disk while you're loading it from disk. This is going to be incredibly!!!! slow. Use a database - it has algorithms and code that are intended to deal with this type of problem, written in highly-optimised (if you believe the TPC ratings) C code. Load data as you need it, discard it when you're done with it. Put as much logic as you can into your SQL statements, let the database handle getting your data in the most efficient manner possible.

        I really, honestly, think that if you cannot afford the database storage, you can't afford any solution. Storage is relatively cheap, and trying to load everything into memory is simply going to fail. The Tie::* modules are likely your next best bet as they probably also load/discard data as needed, allowing you to live within your 2GB of RAM.

Re: How do I measure my bottle ?
by jdporter (Paladin) on Mar 25, 2005 at 14:13 UTC
    I agree that the problem as stated is fairly interesting. However, I wonder of maybe we have an X/Y situation here: rather than figuring out how to find the bottleneck(s) in your current algorithm, perhaps we could find a better algorithm for your underlying problem. Which is... ?
Re: How do I measure my bottle ?
by BrowserUk (Patriarch) on Mar 25, 2005 at 15:22 UTC

    Almost all of those extra 5 seconds is spent allocating and reallocating memory as the hash is extended.

    In theory, you could pre-extend your hash by using keys as an lvalue: keys %hash = 1234567;.

    In reality, there are two problems with this.

    1. The number you assign is the number of "buckets" that are pre-allocated to the hash; but there is no way to pre-determine how many buckets your hash will need, even if you can predict how many keys you will assign.
    2. The space allocated to "buckets" is only a tiny proportion of the space required to build the hash.

      If you run the following 1-liner:

      perl -e"keys %hash = 2_000_000; <>; $hash{ $_ }++ for 1 .. 2_000_000; +<>;"

      When activity stops at the first input, the space allocated to the process will be around 18 MB.

      If you then hit enter, the hash is populated and the space requirement grows to around 192 MB.

    Most of the time is spent allocating the space for the individual keys and values, not the buckets (it seems?).

    In the past, I've tried various schemes to try and grab the required space from the OS in a single chunk rather than in zillions of iddy-biddy bits, but I haven't found a way of doing this using the memory allocator used by the AS built perl's.

    You can very quickly grab a large chunk of memory from the OS using 'X' x ( 200*1024**2); for example.

    D:\>perl -we"'X'x(200*1024**2); <>; $h{ $_ }++ for 1 .. 2_000_000;<>" Useless use of repeat (x) in void context at -e line 1. Name "main::h" used only once: possible typo at -e line 1.

    Despite the "Useless use of repeat (x) in void context" message, you'll see that at the first prompt, the requried 200 MB of space has been allocated to the process (almost instantly), but when you hit enter, the process then goes on to grab 200MB more space.

    Despite that the initial 200 MB was never assigned to anything--that space is never reused. I've tried in vain to find a way to pursuade perl to reuse it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?

      Thanks you and your precious comments again!

      I have to state my task formally:

      My data files consist of two files transaction_info_file (TranactionID, ... personID ...) and Tranaction_detail_file (TranactionID, productID...) and they are related by one-to-many relationship. My task is to do some JOIN and SELECT/COUNT from those data. It sounds very simple before and I put everything in database with fine tuned indexes. However, size does matter.

      Now I use Perl to something like Hash-Join, scan all event_info_file once, put eventid in a seen-hash, then extrieve records from event_detail_file by looking up seen-hash.

      I've tried:

      • Database solution: SQL server, DB2, mySQL
      • Perl solution: Tie::File, Tie::Hash, DBI, my own implemented binary search.
      • another language: VB.NET
      Hash-Join solution is the best among them though still slow. I understand it would be easier if I have larger storage and a well-designed datawarehouse. Is it a must if I only want to know some simple questions such as who is the best buyer ? how many customers do i have ? (this is not a routine work as those bussiness user)

      by the way, i've tried pre-allocated method such as keys %hash=10_000_000,however, it seems no performance gain. Maybe this function is OS-dependent that it does not work on my windows-activeperl platform.

        There may be a faster way to do this, but you'll need to detail more information:

        • How many records in the transaction_info_file?
        • How many records in the Transaction_detail_file?
        • What fields are the records joined on?
        • What is the nature of those fields?
          • How long are they?
          • what form to they take?

            Eg. AAAnnnnmmmmmm etc. And which, if any, parts of those fields are invarient?

        • How often do you need to do this?
        • How long does it currently take?
        • What is your target for improvement?

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?

        If you know what criteria then you only need to remember the users who meet that criteria. It seems like you should be able to work that out somehow so that you are disregarding any info you don't need. Perhaps you could find a way to pair the info down some. Instead of storing everything in a database, just use the database to store your unique id's and the important totals. That should reduce the amount of space needed, and avoid haveing a giant hash in memory. I don't know how that would benchmark against your current strategy but it would seem that even if you can't load the whole data into the database you could figure out just enough to get your answers.

        Another solution might be to do a sort on the file thereby grouping your transactions togther, but my guess is that would be just as slow.

        Good luck and let us know if you find some cool solution ;)


        ___________
        Eric Hodges
Re: How do I measure my bottle ?
by NateTut (Deacon) on Mar 25, 2005 at 20:10 UTC
    IMNSHO the Teradata database is the way to go if you can afford it. It is a proprietary relational DB (with a nice DBI/DBD for Perl). It is a MPP system with multiple nodes that all run in parallel to service each request. It has loading utilities that can load data files such as you describe very quickly. Probably way over your budget though...