Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm looking for an elegant way to read in a binary file, and delete up to a certain byte pattern. For example, if I'm looking for

0x0D 0x0A
what would be the best way to delete up to these bytes up to, but NOT including them?

Should I use `getc` to test each byte? Then if it is true, grab the offset. Then how would I delete to that offset (minus two bytes)? (I heard that `getc` was frowned down upon.)

Thanks for any pointers.

Replies are listed 'Best First'.
(Ovid) Re: Manipulating Binary files
by Ovid (Cardinal) on May 14, 2001 at 19:50 UTC
    getc is very slow and generally shunned. How large is the file? If it can be read into a scalar, you could use the following:
    $bytes =~ s/(?:[^\x0D]|\x0D(?!\x0A))*//;
    You could also write that as:
    $bytes =~ s/.*?(?=\x0D\x0A)//;
    It's easier to read, but it's very inefficient (see Death to Dot Star! for details).

    Aside from that, I use read. Read in chunks of an appropriate size and when you find what you need, substitute out what you don't need, write out the rest to a new file and then continue writing the remainder to a file. Of course, don't forget that if you read in say, 20 bytes at a time, the two bytes you specify could be split and you'll need to test to see if 0x0D is on the end of one read and 0x0A is at the beginning of the next.

    Hideously untested code:

    #!/usr/bin/perl -w use strict; my $in_file = 'file1.txt'; my $out_file = 'file2.txt'; open IN, "< $in_file" or die "Can't open $in_file for reading: $!"; open OUT, "> $out_file" or die "Can't open $out_file for writing: $!"; binmode IN; # in case we're on a Windows system binmode OUT; my $buffer; my $flag = 0; my $last_byte = 0; while ( read( IN, $buffer, 1024 ) ) { if ( $last_byte and substr( $buffer, 0, 1 ) == 0x0A ) { $flag = 1; $buffer = substr( $buffer, 1 ); } else { $last_byte = 0; } if ( $buffer =~ /\x0D\x0A/ ) { $flag = 1; $buffer =~ s/(?:[^\x0D]|\x0D(?!\x0A))*//; } $last_byte = 1 if substr( $buffer, -1 ) == 0x0D; last if $flag; } if ( $flag ) { print OUT $buffer or die "Could not write data to $out_file: $!"; while ( read( IN, $buffer, 1024 ) ) { print OUT $buffer or die "Could not write data to $out_file: $ +!"; } } else { warn "Did not find '0x0D 0x0A' in $in_file"; }
    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

(tye)Re: Manipulating Binary files
by tye (Sage) on May 14, 2001 at 20:12 UTC

    Being paranoid about really big files, I'd probably do:

    { local $/= \4096; binmode(INPUT); binmode(OUTPUT); while( <INPUT> ) { if( s/^.*?(\x0d\x0a)/$1/s ) { print OUTPUT $_; last; } } print OUTPUT $_ while <INPUT>; }
    But setting $/ to be a reference to a block size is a recently added feature so be aware that your version of Perl may not support it yet. In which case you can change <INPUT> to:     read(INPUT,$_,4096)

            - tye (but my friends call me "Tye")
      For a one-shot program, I'd be happy to use your code. However, if one is likely to use this repeatedly (which it doesn't sound like), then there is a potential bug. What happens if 0\x0D is the 4,096th character and 0x0A is the 4,097th? That would be annoying to track down (ain't boundaries a pain?).

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        Oops.

        { binmode(INPUT); binmode(OUTPUT); local $_= ""; while( read( INPUT, $_, 4096, length($_) ) ) { if( s/^.*?(\x0d\x0a)/$1/s ) { print OUTPUT $_; last; } substr( $_, 0, -1 )= ""; } print OUTPUT $_ while read INPUT, $_, 4096; }
        Thanks for catching that.

                - tye (but my friends call me "Tye")
Re: Manipulating Binary files
by MeowChow (Vicar) on May 14, 2001 at 22:07 UTC
    Perhaps I'm missing something, but all the answers thus far seem terribly overcomplicated:
    { local $/ = "\x0D\x0A"; binmode INPUT; binmode OUTPUT; while (<INPUT>) { print OUTPUT if $. > 1; } }
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print

      If the binary file is very large and the "\x0d\x0a" comes late in the file, then <INPUT> is going to read most of the file into memory, which may fail due to the above considerations.

              - tye (but my friends call me "Tye")
        I would guess that the 0D-OA sequence appears with regularity in the file, considering that it's the binary record seperator for DOS/Win32 systems, equivalent to "\n" in the *nix world.
           MeowChow                                   
                       s aamecha.s a..a\u$&owag.print