Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

performance of huge array looping

by kommesel (Sexton)
on Aug 14, 2002 at 10:54 UTC ( [id://190025] : perlquestion . print w/replies, xml ) Need Help??

kommesel has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge array (in the area of gigabytes) and I wish to loop over (with foreach and the like) However, once I loop over an element I have no need for it. It's quiet tempting to use shift at each iteration to shrink the array and save memory. However, the overhead of resizing the array may cause an overall degradation of performance in comparison to just looping over the huge array. Can anyone give me any definite answer as to which will yield more performance (or how can I emprically check it)? P.S. The workstation will use swap files for sure

Replies are listed 'Best First'.
Re: performance of huge array looping
by Abigail-II (Bishop) on Aug 14, 2002 at 11:16 UTC
    As you describe it, the (memory) bottle neck is when you start processing the array. Shifting off the elements at each iteration will save memory - but the memory isn't released back to the OS - it just means that it takes longer before Perl needs to ask more memory from the OS. So, wether it helps to speed up the program depends on what else you do in the program - if you won't use much more memory, there's little gain. OTOH, shift is a fast operation, so there's not much reason to not do it.

    If however you are looking for a real performance boost, look into an algorithm that doesn't require the entire array to exist at one time. The fact you don't need an element anymore after you process it suggests you may not need all the elements before processing the first one either.


      I need three elements at a time.
        The your array doesn't have to have more than 3 elements at a time.


Re: performance of huge array looping
by RMGir (Prior) on Aug 14, 2002 at 11:58 UTC
    If you're dealing with that much data, do you REALLY need to load it all into memory in an array?

    Given that you're only looking at each item once, it would make more sense to either process the data line-by-line from a text file (if it's that kind of data), or iterate over it record-by-record from a database using DBI.

    You program will run MUCH faster, as a bonus, since you're not going to be swapping. This is definitely a case where loading things in memory won't be an optimization.

      The source is an ASCII file. I've thought of it, and basically all I need is to load lines which begin with *| and work on them. and their size is much smaller than the file. So I'll read line by line, push into array only the lines that begin with *| (by the way, I need to use m/^\*\|/, don't I?) and loop over them, thus eliminating the need for swap.
        Makes sense...

        Yes, you would need to escape those characters in your regex.

Re: performance of huge array looping
by demerphq (Chancellor) on Aug 14, 2002 at 11:23 UTC
    Pop is probably more suitable than shift, but luckily for you a lhoward has written a definitive monograph on this subject.

    Imo however you will need to think of a different approach to what you listed here.


    Yves / DeMerphq
    Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

Re: performance of huge array looping
by PhiRatE (Monk) on Aug 14, 2002 at 22:19 UTC
    Its this kind of comment which is fundamentally interesting because of what it *doesn't* say. The value or lack thereof in attempting to shift out elements of the array and thus slowly reduce your memory requirements is insignificant in comparison to the huge burning question of why you need a once-only array in the gigabyte range anyway.

    Even the (dubious) need to sort the entire array would probably be better handled by a chunked merge-sort writing clusters to disk rather than leaving it up to the kernel VM which, while good, cannot do the nice usage predictions you can do on your own paging system under those circumstances.

    In addition, there may be considerable statistical optimisations that can be done if more details of the nature of the data are known.

    So, lets be clear, there may be a modicum of performance to be gained by not shifting vs shifting, however vastly more performance is likely to be gained by posting a better description of the problem here so the algorithm junkies can tell you how to do it without chewing gigabytes of ram.

Re: performance of huge array looping
by Anonymous Monk on Aug 14, 2002 at 11:49 UTC
    I am not it will work for you but works for me I change the array into hash and use DBI to save it as a file open it when I want and close it when I am done - save memory your attention of looping is to find an element so, use your value as the key for the hash the search will be quick