in reply to Re^3: pack/unpack binary editing
in thread pack/unpack binary editing

I have a frame of 480 bytes in which I have a sync word of 10 bits long. I need to be able to find the start of my framing which may not occur on a byte boundary. I also have to take into consideration that the file may skew occationally, so I'd have to check frame by frame for my sync word. I've done this successfully using pack and unpack but it takes forever. I would like to slide through the file without having to pack or unpack.

Replies are listed 'Best First'.
Re^5: pack/unpack binary editing
by BrowserUk (Patriarch) on Feb 08, 2005 at 17:04 UTC

    How long is your current attempt taking?

    The following code processes a 100MB file (including finding and recordng 25million hits of a 10 bit pattern) in ~ 20 seconds, and a 1GB file (250 million hits) in ~ 3 minutes 20 seconds.

    I make that a round 1/2 hour to process your 9GB. And probably much less as your hits will be less frequent and you can advance the buffer pointer by 480 bytes after each hit.

    It uses a basic sliding buffer to process the file in 1 MB chunks with an overlap of enough bytes to ensure continuity. (You'll need to verify the math of the byte/bit offset calculations).

    Code + some timings


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
      Sorry for not getting back sooner. I was out of the office and was doing everything from memory. Here is the code I'm using and for a 1 gig file this takes a few hours. I forgot that I have to change every 2 bits to a corresponding 2 bit sequence. I'm also only looking for the first sync pattern once. I was hoping to do this without unpacking and later packing the data. That's the biggest drawback.
      die "USAGE: $0 input\n" if scalar(@ARGV) < 1; $| = 1; $/ = \960; open IN, "$ARGV[0]" or die "$ARGV[0] : $!"; open OUT, ">tmp2"; my $y = 0; my $value = ''; while (<IN>) { my $bits = unpack("b*", $_); @array = split(//, $bits); foreach $value (@array) { $y++; $tmp = $tmp . $value; if ($y == 2) { if ($tmp =~ /00/) { $tmp = '11'; } elsif ($tmp =~ /11/) { $tmp = '00'; } print OUT $tmp; $y = 0; $tmp = ''; } } } close IN; close OUT; $/ = \7680; my $x = 0; #I LEFT THE OUTPUT AS ASCII-IZED BINARY AT THIS POINT #THUS THE LARGE INCREASE IN FILE SIZE open IN, "$tmp2"; open OUT, >$tmp3"; while (<IN>) { $_ =~ s/^.*(11111011000010001111011100010000)/$1/ if $x == 0; $x = 1; print OUT pack("b*", $_); } close IN; close OUT;

      10-02-2005 Janitored by Arunbear - added code tags, as per Monastery guidelines

        Here is the code I'm using and for a 1 gig file this takes a few hours.

        Pardon me for saying so, but I am not surprised.

        Not only are you unpacking the data to asciized binary, you then go on to split that into an array of digits.

        Then use a loop to go through that array one byte at a time looking for pairs of bytes that match '00' so that you can replace them with '11', or '11' and replace those with '00'.

        All this could be done with a couple of regex.

        Replacing this:

        while (<IN>) { my $bits = unpack("b*", $_); @array = split(//, $bits); foreach $value (@array) { $y++; $tmp = $tmp . $value; if ($y == 2) { if ($tmp =~ /00/) { $tmp = '11'; } elsif ($tmp =~ /11/) { $tmp = '00'; } print OUT $tmp; $y = 0; $tmp = ''; } } }

        with this:

        while (<IN>) { my $bits = unpack("b*", $_); $bits =~ s[(00|11)][ $1 eq '00' ? '11' : '00']ge; print OUT $bits; }

        should do (untested) the same thing and will run very much more quickly.

        Do I understand the logic of this code correctly?

        $/ = \7680; my $x = 0; #I LEFT THE OUTPUT AS ASCII-IZED BINARY AT THIS POINT #THUS THE LARGE +INCREASE IN FILE SIZE open IN, "$tmp2"; open OUT, >$tmp3"; while (<IN>) { $_ =~ s/^.*(11111011000010001111011100010000)/$1/ if $x == 0; $x = 1; print OUT pack("b*", $_); } close IN; close OUT;

        You are checking the first record only for the first occurance of the sync pattern, and then discarding anything that preceeds it?

        Ie. If the first record contains a partial frame, then throw it away and so sync the rest of the file?

        If so, then the following code should be a complete replacement and run in a fraction of the time. The output file "tmp2" will be the final file you are after without creating the 9 GB intermediate.

        Let me know if it works please. Also how long it takes. There are other thing that could be code to speed this up I think, but if the new runtime is acceptable, they may not be worth the extra effort.

        die "USAGE: $0 input\n" if scalar(@ARGV) < 1; $| = 1; $/ = \960; open IN, "$ARGV0" or die "$ARGV0 : $!"; open OUT, ">tmp2"; my $y = 0; my $value = ''; while (<IN>) { my $bits = unpack("b*", $_); ## Replace '00' with '11' and vice versa $bits =~ s[(00|11)][ $1 eq '00' ? '11' : '00']ge; ## Discard any partial fraem from the front of the file. $bits =~ s/^(.*)(?=11111011000010001111011100010000)// if $. == 1; print OUT pack 'b*', $bits; } close IN; close OUT;

        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
Re^5: pack/unpack binary editing
by samizdat (Vicar) on Feb 08, 2005 at 15:24 UTC
    I am in awe of BrowserUK's (and your) willngness to tackle this as a challenge.

    May I humbly offer a simpler suggestion? Do you have control of the file-generator's output? Can you -- by accepting a somewhat larger datafile -- align your data on 512-byte boundaries with padding? Then, the frame is always recognizable and there's room for an individual frame to grow or shrink a bit.