comment on

My musings on bit streams were thoroughly covered here and here, but just to brink it back on track:

I need to see a file as a bit stream. It should be opened, and asked for bits - get_bits(howmany), not in chunks of bytes or words, but BITS. I.e. I may ask it for 1 bit, for 17 bits, etc.

One solution (the simplest and fastest) to represent such a stream is by a string of 1s and 0s, that is read from the file once and unpack()-ed. But there's a problem with this approach: files may get huge (GBs), and memory usage is a problem. Holding such strings in memory is impossible.

The other solution is slower, but unlimited in memory. Keep a buffer of some length (preferably long), and when a request gets beyond the current buffer, fetch the next one and adjust.

Today I hit the memory problem hard to I implemented the second solution. I'd like to kindly ask my fellow monks for advice and guidance - can this be made faster ? I need the fastest get_bits() function possible. Here is the constructor and the get_bits function of the BitStream object:

# buffer size, in bytes
use constant BUF_SIZE => 2048;


# Constructed with a filename 
#
sub new
{
    my $filename = $_[0];
    
    open(FH, "$filename") or die "$myname error: unable to open $filen
+ame: $!\n";
    binmode(FH);
    my $filehandle = *FH;

    my $bytes_buf;
    my $bytes_read = read(*FH, $bytes_buf, BUF_SIZE);    
    my $bits_buf = unpack("B*", $bytes_buf);
    
    # Members:
    #
    # FILENAME
    # FILEHANDLE
    # CUR_BUF - the current buffer held in memory (a bitstring)
    # CUR_BUF_LEN - length of the current buffer (in bits)
    # BUF_NUM - the first buffer in the file is 0, the next 1, and so 
+on
    # BUF_POS - position inside the current buffer
    #
    my $self = {};
    $self->{FILENAME}        = $filename;
    $self->{FILEHANDLE}     = $filehandle;
    $self->{CUR_BUF}         = $bits_buf;
    $self->{CUR_BUF_LEN}    = length($bits_buf);
    $self->{BUF_NUM}        = 0;
    $self->{BUF_POS}        = 0;
    
    print "$myname ($filename) created\n";
    
    bless($self, $myname);
}

# Gets a specified amount of bits. Default is 1
# If the request is for more bits than left in the stream, returns the
+ ones
# left; an empty string is returned when the stream ends
#
sub get_bits
{
    my $self = shift;
    my $howmany = (defined $_[0]) ? $_[0] : 1;
    my $ret_str = "";
    
    ($howmany <= (BUF_SIZE * 8))
        or die "Please read in chunks no longer than ", BUF_SIZE * 8, 
+" bits\n";
    
    my $n_bits_left_in_buf = $self->{CUR_BUF_LEN} - $self->{BUF_POS};
        
    # the request is over the buffer's end ?
    if ($n_bits_left_in_buf < $howmany)
    {    
        #~ print "$self->{CUR_BUF} $self->{BUF_POS} $n_bits_left_in_bu
+f\n";
        # take what's left in the buffer
        $ret_str = substr($self->{CUR_BUF}, $self->{BUF_POS}, $n_bits_
+left_in_buf);
        my $howmany_left = $howmany - $n_bits_left_in_buf;

        # read the next buffer
        my $bytes_buf;
        my $bytes_read = read($self->{FILEHANDLE}, $bytes_buf, BUF_SIZ
+E);
        
        # was the current buffer the last in the file ?
        if (($self->{CUR_BUF_LEN} < BUF_SIZE * 8) or 
            ($bytes_read == 0))
        {
            # then we just read the last bits of the file. returning a
+ string shorter
            # than $howmany signals to the caller that the stream ende
+d
            #
            return $ret_str;
        }
        
        # update buffer info
        $self->{BUF_NUM} += 1;
        $self->{CUR_BUF} = unpack("B*", $bytes_buf);
        $self->{CUR_BUF_LEN} = $bytes_read * 8;
        
        #~ print "> $self->{CUR_BUF} $self->{CUR_BUF_LEN} $howmany_lef
+t\n";
        
        # complete the read from the new buffer
        $ret_str .= substr($self->{CUR_BUF}, 0, $howmany_left);
        $self->{BUF_POS} = $howmany_left;
    }
    else
    # the request still fits the current buffer
    {
        $ret_str = substr($self->{CUR_BUF}, $self->{BUF_POS}, $howmany
+);
        $self->{BUF_POS} += $howmany;
    }
    
    return $ret_str;
}
[download]

Notes:

I need the common case fast, naturally. Buffer refill happens once in a long time - the average chunk read is 128 bits, and the buffer is 16K bits long (can be longer...), so once in 100+ times it's refilled. The reading when no refill is required must be as fast as possible (AFAP :).
Ignore the test ($howmany <= (BUF_SIZE * 8)). I don't want to handle this special case, but I'll probably forbid it by definition, and won't waste cycles in each call of get_bits() to handle it

In reply to BitStream revisited by spurperl

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.