TheMartianGeek has asked for the wisdom of the Perl Monks concerning the following question:

I need to read data from a binary file (no special formatting, just a standard binary file with a different extension) and, starting at a certain offset, find all blocks of data within it containing only one specific character. I have a code from some nice people on a Perl-related IRC channel (although I changed it a little) for doing that:
$/ = \0x8000; my @addrlist = (); while(<$fh>) { $block = $_; while ($block =~ m{(\x00+)}g) { unless(length($1) < $FreespaceSize) { my $t1 = length($1); my $t2 = pos($block) - length($1); push(@addrlist, $t2); push(@addrlist, $t1); } } } return @addrlist;
But this only solves part of the problem. This takes into account that these blocks of data should not cross certain boundaries; in this case, they should not be split between groups of 0x8000 characters. (A block that goes from offset 0x7F70 to 0x8020 would count as two blocks.) There is, however, another factor involved: These blocks must not be part of a protected area marked off with a certain ASCII string.

Specifically, I want to search for blocks of character 0x00 of a certain minimum length. If the minimum length were 8, the character tag were "74 75 76 77" (with the following character determining how many bytes to protect), and the boundaries were after every 0x40 characters, and if this were the data...
09 43 4A 00 00 00 00 00 00 00 00 00 00 00 FC B0 DD 12 46 33 73 7A 8B 01 00 00 00 00 00 00 98 40 34 3F 79 6D DC 2A 2B 35 FF 90 FA 60 66 58 5A 21 40 06 88 F2 11 EE 65 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 44 88 CC 02 A0 74 75 76 77 09 00 00 00 00 00 00 00 00 00 AA
Here, there would be an 11-byte-long block at offset 0x03, a 9-byte-long block at offset 0x37, and a 12-byte-long block at offset 0x40. The 6-byte block of 00s at offset 0x18 do not count because the minimum qualifying length is 8, and the 9-byte block at offset 0x56 does not count because it is protected by the "74 75 76 77" tag. (And in the actual situation, the data protected by the tag could straddle a boundary!) The blocks at offsets 0x37 and 0x40 must be split up into two separate blocks rather than one 21-byte-long block because of the boundary between offsets 0x3F and 0x40. See what I'm getting at here?

As for what I want to do with the character blocks, well...I simply need a subroutine that finds them. In scalar context, it should return the offset at the start of the first block of data in the binary file that fits the criteria (or undefined if there are none), while in list context, it should return a list containing the offsets of ALL blocks that fit the criteria.
  • Comment on Finding large blocks consisting of a single character (but within certain parameters...)
  • Select or Download Code

Replies are listed 'Best First'.
Re: Finding large blocks consisting of a single character (but within certain parameters...)
by BrowserUk (Patriarch) on Mar 07, 2011 at 09:32 UTC

    Guessing the answers to my questions above, this works for your somewhat limited sample:

    #! perl -slw use feature qw[ state ]; use strict; sub findEm { state $protected; my( $fh, $blkSize, $chr, $chrCount ) = @_; my @returns; local $/ = \$blkSize; while( <$fh> ) { my( @matches, @protected ); ## If we had a protected zone that spans the block boundary ## start with the residual push @protected, $protected if defined $protected; ## Look for preliminary matches push @matches, $-[0] while m[(${chr}{$chrCount,})]g; ## skip ahead if there are none. next unless @matches; ## look for protected zones push @protected, [ $-[0], ord( $1 ) ] while m[\x74\x75\x76\x77 +(.)]g; ## If there are some, and the last spans off the end of this b +lock ## record the residual for the next block if( @protected and $protected[ -1 ][ 0 ] + $protected[ -1 ][ 1 ] > $blkSize ) { $protected = [ 0, ( $protected[ -1 ][ 0 ] + $protected[ -1 ][ 1 ] ) % $b +lkSize ]; } else { $protected = undef; } ## Destructively iterate the protected zones while( @protected ) { my( $start, $len ) = @{ pop @protected }; ## comparing them against each match (backward) for my $iMatch ( reverse 0 .. $#matches ) { my $match = $matches[ $iMatch ]; ## if this match precedes the start of ## the current protected zone, next zone last if $match < $start; ## If this match is beyond the end of the current zone +, ## next match next if $match > ( $start + $len ); ## The two overlap so discard the match splice @matches, $iMatch, 1; } } ## Calculate the file offset of the current block my $fOffset = ( $. -1 ) * $blkSize; ## In a non list context unless( wantarray ) { ## undef unless we've at least one match return unless @matches; ## or the file offset of the first if we have one or more return $fOffset + $matches[ 0 ]; } ## Map the match block offsets to file offsets and remember th +em push @returns, map $fOffset + $_, @matches; } ## return them return @returns; } my $fileData = pack 'H*', join'',split ' ', do{ local $/; <DATA> }; open RAM, '<', \$fileData; print 'Scalar context: ', scalar findEm( \*RAM, 0x40, chr(0), 8 ); close RAM; open RAM, '<', \$fileData; print 'List context ', join ', ', findEm( \*RAM, 0x40, chr(0), 8 ); __DATA__ 09 43 4A 00 00 00 00 00 00 00 00 00 00 00 FC B0 DD 12 46 33 73 7A 8B 01 00 00 00 00 00 00 98 40 34 3F 79 6D DC 2A 2B 35 FF 90 FA 60 66 58 5A 21 40 06 88 F2 11 EE 65 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 44 88 CC 02 A0 74 75 76 77 09 00 00 00 00 00 00 00 00 00 AA

    Outputs:

    C:\test>891765 Scalar context: 3 List context 3, 55, 64

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Are these returned offsets, the offsets within the file of the 32k blocks containing matching blocks of null byte blocks? Or the offsets of the null byte blocks within the 32k blocks? Or the offset of null byte blocks within the file?

      They would be the offsets of null byte blocks within the file.

      What does it return in a scalar context if the first 32k block that contains a matching, qualifying block, contains more than one?

      The offset, within the file, of the first qualifying block.

      And as for that subroutine...well, I can't make head or tail out of most of it. Does it need to be that long? And why is the "state" necessary?
        And as for that subroutine...well, I can't make head or tail out of most of it

        Hm. Then I guess it would be easier if you detailed the bits you do understand?

        Does it need to be that long?

        It's shorter if you remove the comments. but I don't think I'll ever be able to make it as short as all those other answers you got.

        And why is the "state" necessary?

        It's not. Feel free to delete it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding large blocks consisting of a single character (but within certain parameters...)
by BrowserUk (Patriarch) on Mar 07, 2011 at 07:48 UTC
    In scalar context, it should return the offset at the start of the first block of data in the binary file that fits the criteria (or undefined if there are none), while in list context, it should return a list containing the offsets of ALL blocks that fit the criteria.

    Are these returned offsets, the offsets within the file of the 32k blocks containing matching blocks of null byte blocks? Or the offsets of the null byte blocks within the 32k blocks? Or the offset of null byte blocks within the file?

    What does it return in a scalar context if the first 32k block that contains a matching, qualifying block, contains more than one?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.