joe_b has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm looking for a faster way to de-interleave binary data. I have some source files (~200MB in size) of interleaved data that contain two data streams. The data is interleaved one bit at a time (e.g., if stream one is all 1's and stream two is all 0's, the file would read as "1010101010101010....", and I would like the resultant files to read "1111111..." and "000000...").

The code below works, but is slow (so it seems to me - it takes about 40 minutes to de-interleave a 192MB file on a modern laptop). I read in some data, split each bit into an array, the split that array into two sub arrays, then write it back out. I'm a newbie to Perl and programming in general, which may be obvious from my code.

until ($filesize <= $offset) { sysseek (INTERLEAVED_DATA, $offset, 0); sysread (INTERLEAVED_DATA, $buffer, 8192); @interleaved = split(//, unpack("b*", $buffer)); $buffer = ""; while (@interleaved){ push @uninterleaved_1, shift @interleaved; push @uninterleaved_2, shift @interleaved; } $uninterleaved_1 = pack "b*", join "", @uninterleaved_1; @uninterleaved_1 = ""; $uninterleaved_2 = pack "b*", join "", @uninterleaved_2; @uninterleaved_2 = ""; syswrite (UNINTERLEAVED_1, $uninterleaved_1); syswrite (UNINTERLEAVED_2, $uninterleaved_2); $uninterleaved_1 = ""; $uninterleaved_2 = ""; $offset = $offset + 8192; }
I've tried reading in bigger chunks of data, but it doesn't help. It seems it's all the unpacking and array operations that take the most time. Thanks.

Replies are listed 'Best First'.
Re: de-interleaving binary data
by Roy Johnson (Monsignor) on Jun 27, 2007 at 21:49 UTC
    I would recommend a lookup table, rather than repeatedly doing the work of de-interleaving. De-interleave every possible 2-byte input into two characters once, then read your data as 2-byte shorts and lookup the output characters.
    use strict; use warnings; my @masks = map 2**$_, 0..15; sub deinterlace { my $data = shift; my ($odd, $even) = (0, 0); $odd |= ($data & $masks[$_*2 ]) >> $_ for (0..7); $even |= ($data & $masks[$_*2+1]) >> ($_+1) for (0..7); [pack('C', $odd), pack('C', $even)]; } # Build lookup table my @lut; $lut[$_] = deinterlace($_) for (0..0xffff); # Example data my @d = unpack('S*', 'Some data to de-interlace.'. "\xfa" x 6 . "\0" x + 6); printf "%04x ", $_ for @d; print "\n"; # Building output strings my ($odd, $even) = ('', ''); for (@d) { my @bytepair = @{$lut[$_]}; $odd .= $bytepair[0]; $even .= $bytepair[1]; } printf "Lengths: %d and %d\n", length($odd), length($even); printf "%02x ", $_ for unpack('C*', $odd); print "\n---\n"; printf "%02x ", $_ for unpack('C*', $even); print "\n";

    Caution: Contents may have been coded under pressure.
Re: de-interleaving binary data
by BrowserUk (Patriarch) on Jun 28, 2007 at 00:49 UTC

    I'd usurp Roy Johnson's lookup tables, but use two strings and substr rather than an @AoA.

    A quick test on a 124MB file takes around a minute, with no great difference whether using 4096 byte or larger reads.

    #! perl -slw use strict; our $SIZE ||= 4096; my( $oddLU, $evenLU ) = ('') x 2; for my $i ( 0 .. 0xffff ) { my( $oddByte, $evenByte ) = ('') x 2; my $word = pack 'n', $i; vec( ${ $_ &1 ? \$evenByte : \$oddByte }, $_>>1, 1 ) = vec( $word, $_, 1 ) for 0 .. 15; $oddLU .= $oddByte; $evenLU .= $evenByte; } open my $inFH, '<:raw:perlio', $ARGV[ 0 ] or die $!; open my $oddFH, '>:raw:perlio', 'odd.bin' or die $!; open my $evenFH, '>:raw:perlio', 'even.bin' or die $!; local $/ = \$SIZE; print scalar localtime; while( my $chunk = <$inFH> ) { # print sysseek $inFH, 0, 1; local $/; my( $odds, $evens ) = ('') x 2; for my $idx ( unpack 'n*', $chunk ) { $odds .= substr $oddLU, $idx, 1; $evens .= substr $evenLU, $idx, 1; } print $oddFH $odds; print $evenFH $evens; } close $_ for $inFH, $oddFH, $evenFH; print scalar localtime; __END__ C:\test>u:ls -l 1Mx4096.db odd.bin even.bin -rw-rw-rw- 1 user group 124592128 Jul 28 2004 1Mx4096.db -rw-rw-rw- 1 user group 62296076 Jun 28 01:42 even.bin -rw-rw-rw- 1 user group 62296076 Jun 28 01:42 odd.bin C:\test>junk 1Mx4096.db Thu Jun 28 01:36:11 2007 Thu Jun 28 01:37:12 2007 C:\test>junk -SIZE=1048576 1Mx4096.db Thu Jun 28 01:38:42 2007 Thu Jun 28 01:39:44 2007 C:\test>junk -SIZE=10485760 1Mx4096.db Thu Jun 28 01:41:31 2007 Thu Jun 28 01:42:56 2007

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Wow. Thank you.

      Can I just send the rest of my project requirements? :)

      My run time went form ~2200 seconds to 72 seconds. I can't tell you how much I appreciate your help.

Re: de-interleaving binary data
by merlyn (Sage) on Jun 27, 2007 at 20:39 UTC
    How about something based on:
    my $data = "1024 bytes of data from the file"; my $odd = ""; my $even = ""; for my $bitpair (0..(4 * length $data - 1)) { my $twobits = vec($data, $bitpair, 2); vec($odd, $bitpair, 1) = 1 if $twobits & 2; vec($even, $bitpair, 1) = 1 if $twobits & 1; }
    I might have odd and even backwards. Experiment to find out. This will turn 1024 bytes into two 512-byte strings.

    I'm guessing your code is slow because you're carrying around a single bit as an entire scalar... huge amounts of data overhead.