comment on

Related to another thread, I need to do simple pattern matching and replacing on a file in binary mode (because its a binary file!).

The main difference will not be in the replacing, but in the reading and writing. Most substituting can be done be reading one line, substituting, writing, repeat. This allows very large text files to be processed quickly (no allocating huge buffers to hold the entire file contents or the file just being too big to even fit in memory).

For a binary file, you could get a similar process quite easily with $/ = \4096;, which would cause <IN> to read a 4096-byte chunk each time. Unfortunately, '77777' could end up with the first two characters at the end of one buffer and the last three characters at the beginning of the next buffer (for example), so s/77777/.../ would fail to substitute that case.

If your binary files are small enough to fit into memory (preferably fit into physical memory but fitting into virtual memory may still be 'fast enough'), then you can just slurp the whole file into a single scalar quite easily (using a 'slurp' module or setting $/ to undef, etc.).

If your binary files are too big, then things get trickier. Probably the most general solution is to use a sliding window. Pick a string length that you are pretty sure is longer than any substring that you'll run into that matches your pattern:

sub binSubst {
    my( $infile, $outfile, $regex, $repl, $maxlen, $bufsiz )= @_;
    binmode($infile);
    binmode($outfile);
    $bufsize ||= 16*1024;
    my $buf= '';
    # Read the next chunk, appending to any left-over bytes:
    while(  sysread( $infile, $buf, $bufsize, length($buf) )  ) {
        $buf =~ s/$regex/$repl/g;
        # How much to write out, unless...
        my $end= length($buf)-$maxlen;
        # ... we matched after that point and so
        # should write upto the end of last match:
        $end= $+[0]    if  $end < $+[0];
        # Write out what we can, removing it from the buffer:
        print $outfile substr($buf,0,$end,'');
    }
    # Write out any left overs:
    print $outfile $buf;
}
[download]

- tye

In reply to Re: Pattern matching in binary mode (I/O) by tye
in thread Pattern matching in binary mode by punchcard_don

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.