JPaul has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I come to you today with a logistical problem rather than a programming one.
Firstly, I am doing this on a linux box.
I am working on a script which deals with very large arrays which will, if left to its own devices, fill up all the memory on a machine.
The most logical thing I can think of is a quick sub to swap in and out the contents of the array, so I'm not completely filling up the memory -- but the question is, for speed and efficiency, what should I swap the unused data out to?

Imagine a web spider that caches unchecked links as it goes by popping them onto an array. Leave it for a few hours/days, and eventually you're out of memory.
I figure what makes sense is keeping a number of entries in memory (say, 50,000), and popping all new 'URLs' onto an out-of-memory stack. When I have depleted my 50,000 URLs in my array, I swap 50,000 back in from the stack and carry on with those.
Making sense? Good. So -- how do I store the stack? A friend suggested using a RDBS, which would work - but could there be a faster way?

Thanks all,
JP

-- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

  • Comment on How to best handle memory intensive operations

Replies are listed 'Best First'.
Re: How to best handle memory intensive operations
by bikeNomad (Priest) on Jul 26, 2001 at 23:59 UTC
    If you can make your data look like a hash, you can use something like BerkeleyDB, which is quite good at handling large amounts of data (with caching, etc.). I wouldn't use an RDBMS unless I had relations (i.e. more than one table) or a need for query language. You don't have either one of these.

    By tuning the page size and cache size, and giving other hints to BerkeleyDB, you can get very good performance, and you can whip up something to try very quickly. I'd suggest using BerkeleyDB through its hash interface, and using a BTree, and then seeing if it's fast enough. The results may surprise you.

      BerkelyDB is a RDBMS.
      The only thing that makes it different is its embedded. I'll give it a try, however, on your advise.

      JP,

      -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

        BerkeleyDB is a DBMS but not a relational one in the usual sense (unless you count the ability to do joins). Most people today think of RDBMS's as things that provide things like an SQL interface, multiple table columns, etc.; although you could build these on top of BerkeleyDB (as MySQL does), it doesn't have them by itself.
Re: How to best handle memory intensive operations
by John M. Dlugosz (Monsignor) on Jul 27, 2001 at 00:04 UTC
    Use a tie to your array or hash, so it lives in a file (database?) and swaps in transparantly.

    There are already database/hash tie modules, but it might not cache the way you want.

    sub STORE { my ($self, $index, $value)= @_; ... if index is in range, store item in my array ... else store it to the file. ... if I appended more elements and hit my high-water .... mark, swap out lower items and adjust my range. }