abitkin has asked for the wisdom of the Perl Monks concerning the following question:

I'm building a binary tree, which is using quite a lot of memory, and I'm conserned.

The data file that is being loaded is about 2 megs, though my data structure in perl takes about 100 megs.

I am using arrays instead of hashes every place that I can and perl objects for the nodes and the leaves.

I'm using leaf pushing to limit the data stored in nodes, and my tree is 31 nodes deep (though I'm storing prefixes which can be any number of nodes, most are 24 bits long thus creating 23 nodes and a leaf.)

In this I'm storing about 93,000 prefixes.

Although that sounds like a lot, remember this is a binary tree, so the number of internal nodes is somewhat limited.

Now should perl be using 100 megs? (I'm deleting null leaves and keeping refs rather than actual strings for the values in the leaves. I also do not have any circular refs or such. Also the object data is implemented as an array.) Here is the new function for both the node and the leaf:
# Node: sub new{ my $self = []; $self->[$node::child]=[]; $self->[$node::leaves] = undef; $self->[$node::myDepth] = undef; $self->[$const::getType] = "Node"; # Added/Changed by KIN # trying to limit mem usage # Date: Tuesday, August 06, 2002 @ 10:18 AM if(not($const::LSNmode == 1)){ $self->[$node::nodes] = undef; $self->[$node::bits] = undef; }else{ $self->[$node::lsnBits] = undef; $self->[$node::nodeBits] = undef; $self->[$node::isLSN] = undef; $self->[$node::reBits] = undef; $self->[$node::reNodes] = undef; } bless($self); return $self; } # Leaf: sub new{ my $self = []; $self->[$ptr::weight] = 0; $self->[$ptr::value] = undef; $self->[$const::getType] = "Leaf"; bless($self); return $self; }

Thanks,
Abitkin

Replies are listed 'Best First'.
Re: Possible Memory Usage Issues
by dws (Chancellor) on Aug 06, 2002 at 22:36 UTC
    Now should perl be using 100 megs?

    That's a bit hard to say without seeing a bit more of your code, including all of the "constants" that you're using as array indexes. If any one of those constants is a large number, you're going to be wasting lots of space in each Node and Leaf.

    You could do away with $cont::getType entirely. You needn't hang on to "Node" or "Leaf" in each object. Instead, use ref() to check which type of thing you're looking at, or add an isLeaf member function to each object, returning 0 or 1 as appropriate.

    Also, rather than hanging an anonymous array off of [$node::leaves], you could save a bit of space by keeping a separate [$node::left] and [$node::right]. The overhead for a slot within an array is lower than the overhead for an array, even if empty.

      Yes you're right, there's a lot of code to go with this, about 1400 lines.

      The reason I'm using an array, is based on command line options I'm passing in, I'm creating either a hex tree or a binary tree. I'm doing this becuase I'm using perl to get stats on the tree (as in the number of bits per prefix and such) for a piece of hardware we are putting a similar data structure in. This is basically our model, for the limiting case.

      Now as you may imagine, with this comes the idea that processing it takes a while. I've gotten it down to about 13 minutes on a dataset of 90,000+ records, using recursive calls and storing more data in nodes. If I were to add an isLeaf member function, I would have the cpu overhead of a function call every time I wanted to use that. I may switch it to 1/0, but doesn't that get rounded up to a 64-bit integer?

      I'm not seeing much space saved there.

      Thanks for the reply.
        If I were to add an isLeaf member function, I would have the cpu overhead of a function call every time I wanted to use that. I may switch it to 1/0, but doesn't that get rounded up to a 64-bit integer?

        You have a simple space/time tradeoff to make, and you've indicated that the problem is space. If the overhead isn't significant for a slot in the object to hold a boolean to indicate whether the object is a leaf, then do it. But if you're still tight on space, trade that boolean for a function call. Sure, a function call takes "more CPU", but do you know whether that additional time is significant (or even measurable)? I'll bet that the overhead of

        package Node; sub isLeaf { 0 } package Leaf; isLeaf { 1 }
        is barely measurable, and may be more than canceled out by not having to grow (at setup time) or access (at use time)the anonymous arrays that underly your objects.

Re: Possible Memory Usage Issues
by kvale (Monsignor) on Aug 06, 2002 at 22:35 UTC
    The figure you quote is not out of bounds for perl; the flexibility of perl's data structure typical take a lot of metadata to keep track of.

    If you are building a binary tree for searching, try a hash instead. It may have a lower multiplier for your data.

    -Mark

      Well, that would be good, it I was searching. What I'm doing is building a tree to represent a datastructure that we are going to be putting into some hardware. I'm not using the tree for searching as much as finding a bound of what the hardware should be.

      Thanks for your reply.
Re: Possible Memory Usage Issues
by perrin (Chancellor) on Aug 07, 2002 at 14:14 UTC
    Sounds like you're just slinging an awful lot of variables, even if they are passed by ref. Can you try using Berkeley DB in its BTree mode for this?
      I possibly could; but, I not only create the tree and get my stats, I also fold up nodes based on their leaf count, and re-calculate the values for the number of bits needed per prefix, the number of internal nodes, and the number of prefixes stored at each level. I don't thik that the Berkeley DB gives me enough flexability to do this, though I may be wrong.

      Oh and incase you're wondering, when I fold up the leaves, I do delete them, to try to get some space back. I also remove the weight from the leaves, as it is no longer needed.