diotalevi has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing Inline::C wrappers over some code that reads and writes LZW compressed data (/usr/bin/compress and /usr/bin/uncompress). That code is already very speedy and copies the data one character at a time into a buffer. I'm having perl manage the buffer as a plain string and I'm growing it whenever the pointer reaches the end.

I'm not experienced with C. Tell me if I'm doing something wrong.

Is there a good value for SZSIZE? I just picked 8192 out of the air because it seemed reasonable. I'd like to avoid fragmenting memory too much. I'd also like to prevent unnecessary copying. Are there any good strategies for serving these goals? Is there some magic value associated with modern (or not so modern) perl that I'd want to follow?

#define SZSIZE 8192 SV* LZW_zread( ... ) { SV* retval; u_char bp, end_bp; u_int offset; ... retval = NEWSV( ..., SZSIZE ); SvPOK_only( retval ); bp = (u_char*)SvPVX( retval ); end_bp = SZSIZE + bp; ... while ( ... ) { if ( bp == end_bp ) { /* When bp is == end_bp then the buffer must be grown. bp is cur +rently just off the end of the buffer and any operations on it is un +safe. */ offset = bp - PvPVX(retval); sv_grow(sv, offset + SZSIZE); bp = PvPVX(retval) + offset; end_bp = SZSIZE + bp; } ... *bp++ = ...; } ... SvCUR_set(sv, bp - SvPVX(retval)); return retval; }

Replies are listed 'Best First'.
Re: Growing strings, avoiding copying and fragmentation?
by Courage (Parson) on Aug 09, 2003 at 15:12 UTC
    I consider growing by 8K is perfect.

    Also, as far as I know, perl itself gets memory from system each time twice as in previous memory allocation, and I find this strategy much worse than allocating by some reasonable chunks.

    But why you do not use Compress::Zlib or similar modules, or even C interface to compression library (as long as you'r in Inline::C)?
    I think this approach will save much more to you and will add portability? (Win32 may not have /usr/bin/compress)

    Courage, the Cowardly Dog

      The issue is that if I want to write a CPAN module that requires the ability to decompress .Z data I need to either have a wrapper over the external binary /usr/bin/uncompress or to include this in a perl library directly. I have a working wrapper but want something that doesn't have the overhead of forking out to /usr/bin/uncompress millions of times.

      Hence: Compress::Zlib::LZW. As is, there is no code on CPAN for handling the .Z data format. Compress::Zlib handles the 'deflate' and 'gzip' formats but doesn't handle the 'compress' format. Its that one that I'm targetting. I already wrote something that sends its output to a file handle but if I want to be 5.005_03 safe then I figured I'd have to return a proper scalar instead of PerlIO tricks.

      Now as to your actual answer - can you reword "Also, as far as I know, perl itself gets memory from system each time twice as in previous memory allocation, and I find this strategy much worse than allocating by some reasonable chunks." this? I think you're speaking from more context than I have right now and I need you to be a bit more explicit.

        1. I thought that "zlib" library can handle compress data, because I use command "gzip -d filename.Z" to uncompress it, and it works.
        May be I am wrong and "zlib" do not handle it, then which one handles it?

        Now I looked into gzip sources from http://ftp.gnu.org/gnu/gzip/ and it appears that "gzip" uncompresses .Z data using its own algorithm without zlib.
        I think using proper file from there will be more portable, but this is not easy way, of course.

        2. Please excuse me for being unclear.
        I meant following behaviour. When my perl program eats big ammount of memory, it does so in non-linear way.
        When I see in task manager how much memory system allocated to processes, I see it like a stair, where first step is small, second step is twice bigger, next step of that stair is two times bigger than previous.
        When I saw that perl already got 256M, I expect it will take 512M next time, and usually at this time I decide whether to allow it to run...
        I remember this was discussed a little on p5p recently, but no special decision was made.

        Courage, the Cowardly Dog

Re: Growing strings, avoiding copying and fragmentation?
by samtregar (Abbot) on Aug 10, 2003 at 06:47 UTC
    You are (probably) the only one who can answer that question. How big does your buffer get on average? How concerned are you about wasting memory? If you're not too concerned then setting SZSIZE to a really large value will probably give you the best performance. If you can accept a moderate ammount of waste then setting it to the average buffer size is probably good.

    While writing HTML::Template::JIT I faced a similar problem. HTML::Template::JIT needs to accumulate generated HTML in a single buffer, but if I guess too low then the buffer might get copied around in memory during a realloc(). The solution I chose was to guess based on the size of the template and how many variables and loops were present how big the result would likely be.

    Of course, whenever you're optimizing the first question should always be "does it really matter?". Try setting SZSIZE to 1 and do a benchmark run. Then set it to 8129 or some other highish value. If the difference is tiny stop immediately and do some profiling to find out where your problem really is! (Note that if I followed this advice HTML::Tempate::JIT wouldn't even exist...)

    -sam

Re: Growing strings, avoiding copying and fragmentation?
by BrowserUk (Patriarch) on Aug 10, 2003 at 11:00 UTC

    As the data is compressed, you don't know the final size, so you can't preallocate the entire buffer. However, you probably do know the compressed size, and it is unlikely to shrink, so why not preallocate the buffer to the compressed size and then expand from there?

    You might even add a scaling factor to the compressed size. If the algoritm used averages 50% compression on the type of data your handling, the double the compressed size and allocate. If this turns out to be an over estimate and your end up shrinking, the cost is likely to be much less than rallocing the buffer many times.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Growing strings, avoiding copying and fragmentation?
by sgifford (Prior) on Aug 11, 2003 at 07:31 UTC

    Many memory allocators double their size each time they run out. That is, they would first allocate 8K, then if that filled up 16K, then 32K, etc. On very large files, it guarantees that you'll only spend logarithmic time allocating and re-organizing memory, and you'll never have more than twice as much memory allocated as you need. A variation of this uses Fibonacci numbers, which grow more slowly than doubling but more quickly than increasing by a fixed size.

    Something like that might be worth a try, though you could find that the memory allocator is doing that already for you, and you're best off just trusting it.

    As always, try different things and then benchmark. One thing that might be easy to overlook in this case is whether your caller is likely to be allocating lots of other memory at the same time as your code is running. If only your code is running, you can use just about any allocation strategy without much trouble because you're the only memory consumer, so there will be little fragmentation. With other code running too, you may find that different strategies are important. One way to test this is to call your module multiple times, each working on a different file. That should stress-test your code and the allocator pretty well. Also make sure you try it with memory that Perl's allocating, to make sure your strategy plays well with Perl's.

Re: Growing strings, avoiding copying and fragmentation?
by waswas-fng (Curate) on Aug 11, 2003 at 07:39 UTC
    One thing you may want to watch is most oses will only release chunks of memory that are fully unused, meaning you have a better chance of releaseing 256 mb memory durring a run of your application if it was allocated in one large chunk vs 256 smaller chunks.

    -Waswas
Re: Growing strings, avoiding copying and fragmentation?
by thor (Priest) on Aug 10, 2003 at 15:16 UTC
    For what it's worth, I usually do something like this to read compressed files:
    if( $file =~ [\.(Z|gz)] ) { open($fh, "gunzip -c $file |" or die "Couldn't fork gunzip: $!"; } else { open($fh, $file), or die "Couldn't open $file: $!"; }
    Sure, it's less portable, but IMHO, a whole lot easier to maintain. If you don't have gunzip, you can always use zcat.

    thor