Re^8: Producing a list of offsets efficiently

Okay. Thanks for that.

assign undef to a spot that is twice as far ....

Is this better/prefereable to assigning to $#array?

For instance if you break the 400,000 array into building 400 arrays of 1000 elements that you then push together, you'll find the same amortized constant coming into play both in building 1000 element arrays and in building the 400,000 element array. So instead of eliminating that overhead, you're now paying twice the overhead!

That's where the AoA idea came up. Effectively making my index a two level affair and reducing the reallocations by only building (and probably preallocating) 400 arrays of 1000 elements rather than 1 of 400,000. The zeroth element of each of the 400 arrays would be an absolute value, but the other 999 would be relative to that base. Hence adjustments required to insert or delete affect (upto) 999 elements in the subarray affected + (upto) 400 absolute base values rather than 400,000 absolute values each time.

If performance is a problem it might save effort to have the data packed into a long string.

And that the final piece in the puzzle. packing the relative values into strings reduces the number of scalers by 3 orders. For most purposes, offsets (line lengths) can be packed into 8 bits reducing memory consuption further. By wrapping an exception eval around the packing and looking for "Character in 'C' format wrapped in pack" errors, I can detect if a line goes over 255 chars with the penalty of re-indexing to use 'n' if it happens. Ditto 'n' _> 'N'.

Moving to Inline:C or XS is an option if needed.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Comment on Re^8: Producing a list of offsets efficiently

Replies are listed 'Best First'.
Re^9: Producing a list of offsets efficiently by tilly (Archbishop) on May 30, 2005 at 07:53 UTC
Assigning to $#array is preferrable, I just didn't think of that off of the top of my head. The two-level index should work. The overhead of accessing it indirectly may lose the benefits of avoiding reallocations. When it comes to strings, I was thinking something simpler. Use 4 bytes per offset. Pack each offset into those 4 bytes. Sure, you can save more memory, but see if the simple approach is a big enough win. (It certainly should take less code, and makes it easy to access the 432343rd offset - depending on what you do this could be a big win.) Personally I'd avoid all of these approachs unless I knew that the naive approach had serious problems for my dataset. (Yes, you've indicated why you think that it may for you. This is a reminder for anyone else who might be reading this thread.)	[reply]
Re^10: Producing a list of offsets efficiently by BrowserUk (Patriarch) on May 30, 2005 at 08:11 UTC
Yes, you've indicated why you think that it may for you. Not "may". It is. The primary intent is to reduce the space required to manipulate the file in memory. The secondary goal is to limit the effects of trading speed for space by doing it as efficiently as Perl allows. There will always be the overhead of tie involved which forms a watermark below which I cannot dip, but within that there is scope for economies. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply]