Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am trying to store approximately 50 - 100 million variable length strings into an array and of course this is not working due to memory constraints. I have converted the strings to unique integers and then stored the integers in a vec which lets me store about 60 million before I run out of memory. My question is: has anyone else run into this problem and how did you solve it? Thanks!

Replies are listed 'Best First'.
Re: string arrays
by davido (Cardinal) on Oct 15, 2003 at 02:34 UTC
    Consider a design that allows the strings to reside somewhere more commodious than RAM.

    100 million strings, even if they're only 4 bytes wide, is about 380meg (assuming pretty much zero overhead).

    If you'll be doing much with the strings, that's too much for a "flat file". I mean sure, you could literally store 380meg in a flat file, but access would be slow and forget about writing to the 78,294,262nd string. A database solution is a good way to go for its random access features and scalability to such large datasets.


    Dave


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein
      Sorry about that. I work with large text and am working on implementing suffix arrays in Perl. This is why I need to store so much data. The entire text needs to be stored in memory. I have tried the database approach using BerkeleyDB and DB_File both are very nice but kill me on IO. It simply takes to much time (weeks). I can convert the strings to integers and then store them in a vec. This seems to be working okay so far but I was curious if anyone had a better solution. Thanks for the quick responses, I didn't expect to get so many this morning!!
Re: string arrays
by Zaxo (Archbishop) on Oct 15, 2003 at 02:36 UTC

    You don't say what you're doing with all that data, which would help you get a useful answer. For many purposes, sticking them into a database of some kind would be the thing to do.

    After Compline,
    Zaxo

Re: string arrays
by Roger (Parson) on Oct 15, 2003 at 02:55 UTC
    Can you please tell us why you want to store 50 - 100 million variable length strings into an array in the first place? Is it for sorting purpose? Or is it because you want to find a particular element from the array? There are lots of algorithms to choose from depending on the problem dealing with. Perhaps you should give a bit of overview on the type of problem you are trying to solve.

    In general, you could however store your strings in a file or database as suggested by davido and Zaxo, depending on your application of cause.

    Personally I would favour the direct file approach, database insertion is a relatively expensive operation. Unless of cause if you use commands like 'bcp' in Sybase (just an example) that does native database import that is very fast.

      Fast native C import routines are pretty standard with any halfway decent RDBMS.

      MySQL LOAD DATA INFILE '/tmp/blah.txt' INTO TABLE mytable FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; MS SQL BULK INSERT [table name] FROM [filename to insert] WITH (FIELDTERMINATOR = '\t', FIRSTROW=2, ROWTERMINATOR = '\n') Oracle Two stage control file insert methadology similar to above

      I would make the rash presumption that the user already has the data in a (text) file (hardly likely to type in 100 million strings by hand as it would take literally years) and wants to manipulate them in some way. A RDBMS is probably the best solution. Multiple gigabytes of RAM is another possibility. A text file would seem the have rather dubious utility.....

      We use big RAM or a DB depending on the task. We have one data munging widget that takes several hours to run and consumes up to 4 GB of RAM. The same algorithm tied to disk to save memory took weeks to run. In this case the extra speed of in memory processing easily offets the RAM cost.

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print