in reply to Re^4: pack/unpack binary editing
in thread pack/unpack binary editing

How long is your current attempt taking?

The following code processes a 100MB file (including finding and recordng 25million hits of a 10 bit pattern) in ~ 20 seconds, and a 1GB file (250 million hits) in ~ 3 minutes 20 seconds.

I make that a round 1/2 hour to process your 9GB. And probably much less as your hits will be less frequent and you can advance the buffer pointer by 480 bytes after each hit.

It uses a basic sliding buffer to process the file in 1 MB chunks with an overlap of enough bytes to ensure continuity. (You'll need to verify the math of the byte/bit offset calculations).

Code + some timings

#! perl -slw use strict; $|=1; my $BSIZE ||= 1_000_000; open my $fh, '+<:raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my $pattern = $ARGV[ 1 ] or die "no pattern supplied"; my $overlap = int( ( length( $pattern ) + 1 ) / 8 ); my $buffer = ''; my $buffs = 0; my $found = 0; while( sysread( $fh, $buffer, $BSIZE, length $buffer ) ) { ## Convert the buffer to asciiized bits; my $bits = unpack 'B*', $buffer; printf "\r$buffs: [$found] "; ## Search for the pattern my $p = 0; while( $p = 1 + index( $bits, $pattern, $p ) ) { ## And record the hits $found++; ## Calculate byte/bit offsets # my $byte = ( $buffs * $BSIZE ) # - ( $overap * $buffs ) # + int( ( $p - 1 ) / 8 ); # my $bit = ( $p - 1 ) % 8; # printf "\rFound it at byte: $byte bit: $bit '%s'", # substr( $bits, $p-1, length( $pattern ) );; } ## Keep track of the number of buffers process $buffs++; ## Move enough bytes to the front of the buffer ## to ensure overlap. $buffer = substr( $buffer, -$overlap ); } print "Found $found occurances of '$pattern'"; __END__ [16:40:12.64] P:\test>429065 data\100millionbytes.dat 1111111111 100: [0] Found 0 occurances of '1111111111' [16:40:46.04] P:\test> [16:41:37.46] P:\test>429065 data\100millionbytes.dat 1100000011 100: [24999982] Found 24999990 occurances of '1100000011' [16:41:56.28] P:\test> [16:42:03.65] P:\test>429065 data\1000millionbytes.dat 1100000011 1000: [249999876] Found 249999900 occurances of '1100000011' [16:45:24.09] P:\test>

Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^6: pack/unpack binary editing
by tperdue (Sexton) on Feb 10, 2005 at 12:40 UTC
    Sorry for not getting back sooner. I was out of the office and was doing everything from memory. Here is the code I'm using and for a 1 gig file this takes a few hours. I forgot that I have to change every 2 bits to a corresponding 2 bit sequence. I'm also only looking for the first sync pattern once. I was hoping to do this without unpacking and later packing the data. That's the biggest drawback.
    die "USAGE: $0 input\n" if scalar(@ARGV) < 1; $| = 1; $/ = \960; open IN, "$ARGV[0]" or die "$ARGV[0] : $!"; open OUT, ">tmp2"; my $y = 0; my $value = ''; while (<IN>) { my $bits = unpack("b*", $_); @array = split(//, $bits); foreach $value (@array) { $y++; $tmp = $tmp . $value; if ($y == 2) { if ($tmp =~ /00/) { $tmp = '11'; } elsif ($tmp =~ /11/) { $tmp = '00'; } print OUT $tmp; $y = 0; $tmp = ''; } } } close IN; close OUT; $/ = \7680; my $x = 0; #I LEFT THE OUTPUT AS ASCII-IZED BINARY AT THIS POINT #THUS THE LARGE INCREASE IN FILE SIZE open IN, "$tmp2"; open OUT, >$tmp3"; while (<IN>) { $_ =~ s/^.*(11111011000010001111011100010000)/$1/ if $x == 0; $x = 1; print OUT pack("b*", $_); } close IN; close OUT;

    10-02-2005 Janitored by Arunbear - added code tags, as per Monastery guidelines

      Here is the code I'm using and for a 1 gig file this takes a few hours.

      Pardon me for saying so, but I am not surprised.

      Not only are you unpacking the data to asciized binary, you then go on to split that into an array of digits.

      Then use a loop to go through that array one byte at a time looking for pairs of bytes that match '00' so that you can replace them with '11', or '11' and replace those with '00'.

      All this could be done with a couple of regex.

      Replacing this:

      while (<IN>) { my $bits = unpack("b*", $_); @array = split(//, $bits); foreach $value (@array) { $y++; $tmp = $tmp . $value; if ($y == 2) { if ($tmp =~ /00/) { $tmp = '11'; } elsif ($tmp =~ /11/) { $tmp = '00'; } print OUT $tmp; $y = 0; $tmp = ''; } } }

      with this:

      while (<IN>) { my $bits = unpack("b*", $_); $bits =~ s[(00|11)][ $1 eq '00' ? '11' : '00']ge; print OUT $bits; }

      should do (untested) the same thing and will run very much more quickly.

      Do I understand the logic of this code correctly?

      $/ = \7680; my $x = 0; #I LEFT THE OUTPUT AS ASCII-IZED BINARY AT THIS POINT #THUS THE LARGE +INCREASE IN FILE SIZE open IN, "$tmp2"; open OUT, >$tmp3"; while (<IN>) { $_ =~ s/^.*(11111011000010001111011100010000)/$1/ if $x == 0; $x = 1; print OUT pack("b*", $_); } close IN; close OUT;

      You are checking the first record only for the first occurance of the sync pattern, and then discarding anything that preceeds it?

      Ie. If the first record contains a partial frame, then throw it away and so sync the rest of the file?

      If so, then the following code should be a complete replacement and run in a fraction of the time. The output file "tmp2" will be the final file you are after without creating the 9 GB intermediate.

      Let me know if it works please. Also how long it takes. There are other thing that could be code to speed this up I think, but if the new runtime is acceptable, they may not be worth the extra effort.

      die "USAGE: $0 input\n" if scalar(@ARGV) < 1; $| = 1; $/ = \960; open IN, "$ARGV0" or die "$ARGV0 : $!"; open OUT, ">tmp2"; my $y = 0; my $value = ''; while (<IN>) { my $bits = unpack("b*", $_); ## Replace '00' with '11' and vice versa $bits =~ s[(00|11)][ $1 eq '00' ? '11' : '00']ge; ## Discard any partial fraem from the front of the file. $bits =~ s/^(.*)(?=11111011000010001111011100010000)// if $. == 1; print OUT pack 'b*', $bits; } close IN; close OUT;

      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        I will give this a shot. What I really need to do with the sync pattern is find every occurance and extract it along with the following 476 bytes discarding everything after the 476th byte up to the next sync.