in reply to Pattern matching in binary mode
Related to another thread, I need to do simple pattern matching and replacing on a file in binary mode (because its a binary file!).
The main difference will not be in the replacing, but in the reading and writing. Most substituting can be done be reading one line, substituting, writing, repeat. This allows very large text files to be processed quickly (no allocating huge buffers to hold the entire file contents or the file just being too big to even fit in memory).
For a binary file, you could get a similar process quite easily with $/ = \4096;, which would cause <IN> to read a 4096-byte chunk each time. Unfortunately, '77777' could end up with the first two characters at the end of one buffer and the last three characters at the beginning of the next buffer (for example), so s/77777/.../ would fail to substitute that case.
If your binary files are small enough to fit into memory (preferably fit into physical memory but fitting into virtual memory may still be 'fast enough'), then you can just slurp the whole file into a single scalar quite easily (using a 'slurp' module or setting $/ to undef, etc.).
If your binary files are too big, then things get trickier. Probably the most general solution is to use a sliding window. Pick a string length that you are pretty sure is longer than any substring that you'll run into that matches your pattern:
sub binSubst { my( $infile, $outfile, $regex, $repl, $maxlen, $bufsiz )= @_; binmode($infile); binmode($outfile); $bufsize ||= 16*1024; my $buf= ''; # Read the next chunk, appending to any left-over bytes: while( sysread( $infile, $buf, $bufsize, length($buf) ) ) { $buf =~ s/$regex/$repl/g; # How much to write out, unless... my $end= length($buf)-$maxlen; # ... we matched after that point and so # should write upto the end of last match: $end= $+[0] if $end < $+[0]; # Write out what we can, removing it from the buffer: print $outfile substr($buf,0,$end,''); } # Write out any left overs: print $outfile $buf; }
- tye
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Pattern matching in binary mode (I/O)
by kschwab (Vicar) on Oct 20, 2021 at 21:34 UTC |