in reply to Re^6: Producing a list of offsets efficiently
in thread Producing a list of offsets efficiently

To see the source of 1/5 thing, look at the source to av.c. The newmax variable tracks Perl's buffering decisions quite precisely. That doesn't explain the consequences though. To understand the consequences, consider what happens just after you've had to reallocate. At that point all elements were reallocated once, 5/6 of them twice, 25/36 of them three times and so on. The total number of reallocations is therefore (1 + 5/6 + (5/6)**2 + ...). Multiply and divide by (1-5/6) then simplify to see that on average, each element has been reallocated 6 times. (Right before a reallocation that average drops to 5.)

The point about amortized constant is critical. The convergence to that behaviour is quite fast. For instance if you break the 400,000 array into building 400 arrays of 1000 elements that you then push together, you'll find the same amortized constant coming into play both in building 1000 element arrays and in building the 400,000 element array. So instead of eliminating that overhead, you're now paying twice the overhead! (By separating arrays you've kept Perl from deciding to allocate big enough chunks.)

If you want to reduce that overhead, you can change the constant fairly easily by pre-extending the array at specified intervals. Assign by index and keep track of the size of the array. Any time you're about to pass the maximum size of the array, assign undef to a spot that is twice as far. That'll change the constant from 5 to 2, and the average reallocations from ranging from 5-6 down to ranging from 2-3. (At the cost of a constant amount of logic per iteration. I don't know which constant will be bigger - this might be a loss.)

Overall I'm going to guess, though, that with 400,000 offsets your biggest source of overhead is that you have 400,000 Perl scalars. If performance is a problem it might save effort to have the data packed into a long string. You might even use C to build said string.

  • Comment on Re^7: Producing a list of offsets efficiently

Replies are listed 'Best First'.
Re^8: Producing a list of offsets efficiently
by BrowserUk (Patriarch) on May 30, 2005 at 07:37 UTC

    Okay. Thanks for that.

    assign undef to a spot that is twice as far ....

    Is this better/prefereable to assigning to $#array?

    For instance if you break the 400,000 array into building 400 arrays of 1000 elements that you then push together, you'll find the same amortized constant coming into play both in building 1000 element arrays and in building the 400,000 element array. So instead of eliminating that overhead, you're now paying twice the overhead!

    That's where the AoA idea came up. Effectively making my index a two level affair and reducing the reallocations by only building (and probably preallocating) 400 arrays of 1000 elements rather than 1 of 400,000. The zeroth element of each of the 400 arrays would be an absolute value, but the other 999 would be relative to that base. Hence adjustments required to insert or delete affect (upto) 999 elements in the subarray affected + (upto) 400 absolute base values rather than 400,000 absolute values each time.

    If performance is a problem it might save effort to have the data packed into a long string.

    And that the final piece in the puzzle. packing the relative values into strings reduces the number of scalers by 3 orders. For most purposes, offsets (line lengths) can be packed into 8 bits reducing memory consuption further. By wrapping an exception eval around the packing and looking for "Character in 'C' format wrapped in pack" errors, I can detect if a line goes over 255 chars with the penalty of re-indexing to use 'n' if it happens. Ditto 'n' _> 'N'.

    Moving to Inline:C or XS is an option if needed.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Assigning to $#array is preferrable, I just didn't think of that off of the top of my head.

      The two-level index should work. The overhead of accessing it indirectly may lose the benefits of avoiding reallocations.

      When it comes to strings, I was thinking something simpler. Use 4 bytes per offset. Pack each offset into those 4 bytes. Sure, you can save more memory, but see if the simple approach is a big enough win. (It certainly should take less code, and makes it easy to access the 432343rd offset - depending on what you do this could be a big win.)

      Personally I'd avoid all of these approachs unless I knew that the naive approach had serious problems for my dataset. (Yes, you've indicated why you think that it may for you. This is a reminder for anyone else who might be reading this thread.)

        Yes, you've indicated why you think that it may for you.

        Not "may". It is.

        The primary intent is to reduce the space required to manipulate the file in memory. The secondary goal is to limit the effects of trading speed for space by doing it as efficiently as Perl allows. There will always be the overhead of tie involved which forms a watermark below which I cannot dip, but within that there is scope for economies.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.