dansmith has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on an application that needs to hold a lot of very simple text /[ACTGN]{300}/ x 1,000,000 in memory. However, Perl seems to very stubbornly insist on using 32 bits for each character. As a result, when I load a 100MB file, 400MB of memory is used. See for yourself:
/usr/bin/time -f "%MKb" -- perl -e "'A' x 1024 x 1024 x 100"

Is there any way to store this data in a more economical format? pack() doesn't help at all. I've tried a lot of hideous hacks to no avail. Any tricks or tips would be greatly appreciated.

One requirement though: I want to only use Perl core modules since I'll be distributing this application to people who may not know how to install additional modules. So Bit::Vector is out.

Thanks in advance,
   -Dan

Replies are listed 'Best First'.
Re: Reducing memory footprint of strings
by ikegami (Patriarch) on Aug 05, 2010 at 19:58 UTC

    However, Perl seems to very stubbornly insist on using 32 bits for each character.

    That's not true.

    ASCII characters require one byte.
    Other iso-8859-1 characters require one or two bytes.
    Other Unicode characters require up to four bytes. (Usually two. Practically, never four.)

    Both Devel::Size and a peek at the size of the internal of the internal show differently.

    $ perl -MDevel::Size=total_size -E'say total_size("A" x 1024 x 1024 x +100)' 104857636 $ perl -MDevel::Peek -E'Dump("A" x 1024 x 1024 x 100)' 2>&1 | grep LEN LEN = 104857604

    The rest of the memory was used to build the string and is available for reuse. There's also the possibility that you are making a copy of the string by using it as the return value of the file. (Do you see a difference from appending ";1"? I'm getting 0 from time.)

Re: Reducing memory footprint of strings
by dave_the_m (Monsignor) on Aug 05, 2010 at 19:56 UTC
    perl stores ASCII strings using one byte per character. What you are seeing is operations on strings calculating intermediate results that take extra storage. That can be reduced with careful coding. To improve matters further you can use the built-in vec function to store things more compactly.

    Dave.

[SOLVED] Re: Reducing memory footprint of strings
by dansmith (Initiate) on Aug 05, 2010 at 20:48 UTC
    Hmmmm... looks like you two are right.

    I also put a sleep command in there, then looked at `top`. RES was right around the expected 100MB.

    I guess the `time` statistic is telling me something else or is plain wrong.

    @ ikegami: I noticed the `time` command doesn't put out all the different benchmarks on some systems. Also, in order to define a custom format like "%M", be sure to call it directly at /usr/bin/time so an alias doesn't trip you up (like it did me).

    Thanks!