comment on

I need to read data from a binary file (no special formatting, just a standard binary file with a different extension) and, starting at a certain offset, find all blocks of data within it containing only one specific character. I have a code from some nice people on a Perl-related IRC channel (although I changed it a little) for doing that:

    $/ = \0x8000;
        my @addrlist = ();
        while(<$fh>)
        {
            $block = $_;
            while ($block =~ m{(\x00+)}g)
            {
                unless(length($1) < $FreespaceSize)
                {
                    my $t1 = length($1);
                    my $t2 = pos($block) - length($1);
                    push(@addrlist, $t2);
                    push(@addrlist, $t1);
                }
            }
        }
        return @addrlist;
[download]

But this only solves part of the problem. This takes into account that these blocks of data should not cross certain boundaries; in this case, they should not be split between groups of 0x8000 characters. (A block that goes from offset 0x7F70 to 0x8020 would count as two blocks.) There is, however, another factor involved: These blocks must not be part of a protected area marked off with a certain ASCII string.

Specifically, I want to search for blocks of character 0x00 of a certain minimum length. If the minimum length were 8, the character tag were "74 75 76 77" (with the following character determining how many bytes to protect), and the boundaries were after every 0x40 characters, and if this were the data...

09 43 4A 00 00 00 00 00 00 00 00 00 00 00 FC B0
DD 12 46 33 73 7A 8B 01 00 00 00 00 00 00 98 40
34 3F 79 6D DC 2A 2B 35 FF 90 FA 60 66 58 5A 21
40 06 88 F2 11 EE 65 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 44 88 CC 02
A0 74 75 76 77 09 00 00 00 00 00 00 00 00 00 AA
[download]

Here, there would be an 11-byte-long block at offset 0x03, a 9-byte-long block at offset 0x37, and a 12-byte-long block at offset 0x40. The 6-byte block of 00s at offset 0x18 do not count because the minimum qualifying length is 8, and the 9-byte block at offset 0x56 does not count because it is protected by the "74 75 76 77" tag. (And in the actual situation, the data protected by the tag could straddle a boundary!) The blocks at offsets 0x37 and 0x40 must be split up into two separate blocks rather than one 21-byte-long block because of the boundary between offsets 0x3F and 0x40. See what I'm getting at here?

As for what I want to do with the character blocks, well...I simply need a subroutine that finds them. In scalar context, it should return the offset at the start of the first block of data in the binary file that fits the criteria (or undefined if there are none), while in list context, it should return a list containing the offsets of ALL blocks that fit the criteria.

In reply to Finding large blocks consisting of a single character (but within certain parameters...) by TheMartianGeek

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.