tperdue has asked for the wisdom of the Perl Monks concerning the following question:

I have a 3 gig binary file that I need to edit. I'm running into memory problems since I have to pack and unpack the file. I've tried running on a unix, linux and windows machine with no luck. Any ideas on what I can do? I basically need to open the file and edit it at a bit level. I need to find certain bit patterns and remove them. I've tried to edit using hex characters but have found when I remove the character instead of only 8 bits being removed 16 get removed. Have no idea why that happens.

Per requests below is the code:

#!/usr/local/bin/perl -w undef $/; #Clear the record seperator. $| = 1; #Don't Buffer data. Write out when available. open IN, "$ARGV[0]"; binmode IN; #Use binary mode so windows/DOS won't complain #Convert original file into binary and write to $final while (<IN>) { #Loop thru file and convert file to 1's/0's $array = unpack("B*", $_); open OUT, ">>tmp"; print OUT "$array"; close OUT; undef $array; } close IN; print "Finished unpacking data.\n\n"; undef $/; open IN, "tmp"; #Extract the necessary data bits while (<IN>) { $_ =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2/g; $final - pack("B*", $_); #Conver data back to original binary for +mat open OUT, ">>$ARGV[1]"; print OUT "$final"; close OUT; undef $final; } close IN; unlink "tmp"; print "Finished converting to binary.\n\n";

20041021 Janitored by Corion: Added formatting

Replies are listed 'Best First'.
Re: Out of memory problems
by BrowserUk (Patriarch) on Oct 20, 2004 at 21:39 UTC

    It's difficult to understand your problem(s) on the basis of this rather sparse description of them. It's much easier to suggest solutions of you show us the code.

    For the memory problem, it's almost always possible to avoid running out of memory by only reading manageable chunks of the file at a time. See perlvar and the description for $/. Particularly setting  $/ = \4096; in order to read 4kb at a time with <fh>. Alternatively look at sysread in perlfunc.

    For your editing problem, it sounds like your using regex on binary data, but that data is being treated as unicode. Total speculation, that would be clarified by seeing the code, or at least a worked example of the input, method of edit, actual and expected outputs.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      I updated my original question with the code I'm using. Hopefully that'll help.

        Hmm. I'm not surprised your processing is slow. Converting the whole 3 GB to ascii-ized binary will produce a 24 GB file. Then search/replace on that can convert back as separate stages?

        Looking at your regex, it looks like your patterns are probably byte aligned? Very few processes produce data that is not...but I realise that it is possible. This how would tackle the task assuming tha the data is byte aligned.

        The technique is called a 'sliding buffer'. You read a buffer of at least double the size of the thing you are looking for. In this case, 384 x 2 = 768 bytes. You then apply the regex to the whole of that buffer, write out the first half of it, move the second half to the front, top it up from the input file and re-apply the regex to the whole thing.

        Note. That is a simplified description and the following code implements that simplified description by way of example only. It isn't tested and almost certainly won't work as is.

        #!/usr/local/bin/perl -w use strict; open IN, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; my $buffer; sysread IN, $buffer, 384, 384; do{ ## Move the second half of the buffer to the front. $buffer = substr( $buffer, 384 ); ## and overwrite it with a new chunk sysread IN, $buffer, 384, length( $buffer ); ## Apply the regex $buffer =~ s[ \xF4 . ## The marker byte plus friend ( .{190} ) ## 1520 bits to retain \xF4 . ## Second marker + friend ( .{58} ) ## 464 bits to retain .{132} ## 1056 bits to discard ][$1$2]xg; ## Write out the first half of the buffer syswrite OUT, $buffer, 384; } until eof IN; close IN; close OUT;

        The are some further complications that need to be addressed.

        • If your data is not byte aligned, then you can still unpack/pack the binary data on input/output and use your original regex--but avoid this if it is not necessary as it will slow your process down by an order of magnitude a lot.
        • As you are modifying the buffer when the regex matches, you will need to apply different handling when a match has occured, and when it has not.

          When no substitution has taken place, the above method should be okay.

          When a substitution has occured, you will need to ensure that you do not re-process that amount of the buffer upto the place where the end of the last substitution occured--other wise you could get a false match/substitution occuring. See perlfunc:pos and /or perlvar @- & @+ for more information.

          Basically, you need to find out which part of the buffer has already been processed and write that out and only copy the remainder to the front of the buffer before refilling it to the x2 length.

          It will take thought, and testing (and testdata) to get this right! It is non-trivial to do and harder to describe.

        • Using a larger buffer.

          Once you have the process working, instead of using a x2 buffer, you could save considerable time by reading and writing much larger chunks-- 1, 2, 10, or even 100 MB would be possible.

          The only thing to remember is that if no match occurs you must retain the last 384 bytes from the buffer and place these at the front of the next to ensure full coverage.

          If one (or more) substitutions have occurred, then you must retain everything from the end of the last substitution to the end--if that less than 384 bytes--or the last 384 bytes.

        Processing your 3 GB this way should take no more than 1/2 hour if you can avoid the packing/unpacking. And not much more than a couple of hours if you cannot. These are ballpark figures!

        I hope that rather sketchy explaination makes some kind of sense. Good luck.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Out of memory problems
by Anonymous Monk on Oct 20, 2004 at 21:36 UTC
    How are you reading the file? and/or how many bytes are you reading at a time?

    Are you using <FH>, with or without changing $/? or read/sysread(FH)?

    If it is possible then I suggest you read only parts of your file at a time (1 kb?, 100 kb?, 1 mb, ...?). You can do this either by using read/sysread or by setting $/ to a reference to an integer: eg: $/ = \1024;, this way <FH> will/should only read 1024 bytes.

      Here is the code I'm using. Hopefully it's help.
      #!/usr/local/bin/perl -w undef $/; #Clear the record seperator. $| = 1; #Don't Buffer data. Write out when available. open IN, "$ARGV[0]"; binmode IN; #Use binary mode so windows/DOS won't complain #Convert original file into binary and write to $final while (<IN>) { #Loop thru file and convert file to 1's/0's $array = unpack("B*", $_); open OUT, ">>tmp"; print OUT "$array"; close OUT; undef $array; } close IN; print "Finished unpacking data.\n\n"; undef $/; open IN, "tmp"; #Extract the necessary data bits while (<IN>) { $_ =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2/g; $final - pack("B*", $_); #Conver data back to original binary for +mat open OUT, ">>$ARGV[1]"; print OUT "$final"; close OUT; undef $final; } close IN; unlink "tmp"; print "Finished converting to binary.\n\n";

      20041021 Janitored by Corion: Added code tags

Re: Out of memory problems
by TedPride (Priest) on Oct 21, 2004 at 05:00 UTC
    Read the file in chunks, unpack each chunk, check for patterns in the last chunk plus the current chunk (if the patterns may extend over chunk boundaries), pack and print half the modified data to a second file, assign the other half to last chunk, rinse and repeat. There should be no need to ever have more than a few K of data in memory.
      I would have a problem with patterns extending over chunk boundaries. Do you have an example of how I would overcome this?
Re: Out of memory problems
by superfrink (Curate) on Oct 21, 2004 at 02:44 UTC
    Just a couple quick notes as I'm running out the door.

    Under Linux at least there could be a 2 Gig file limit in the OS or system libraries.

    Check how much RAM your process is allowed to use. In bash run "ulimit -a" to see what your limits are. You have to be root to raise limits and the limts are inherited by child processes. Oh and the details should be in the "bash" man page.

    Also depending on how you're dealing with the file (eg reading it all into an array) then the actual system may be running out of RAM and/or swap space.

    Good luck. - chad
Re: Out of memory problems
by periapt (Hermit) on Oct 21, 2004 at 12:36 UTC

    I think TedPride has the right idea. Setting $INPUT_RECORD_SEPERATOR to undefined (undef $/) has the effect of causing the while(<IN> statement to slurp the entire three+ gig file at once. From the docs
    "Entirely undefining $/ makes the next line input operation slurp in the remainder of the file as one scalar value"
    Set $/ to some reasonable size, maybe to satisfy memory limitations. Then you can match blocks of the file in much the way TedPride describes. Something like this

    #undef $/; $/ = \2048; # 2K blocks my $blocksz = 2048; open IN, "tmp"; open OUT, ">>$ARGV[1]"; #more efficient when parsing more than one b +lock #Extract the necessary data bits my $block01 = <IN>; my $block02 = ''; while ($block02 = <IN>) { my $block = $block01.$block02; $block =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2 +/g; # $_ =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2/g; # $final - pack("B*", $_); #Conver data back to original binary fo +rmat # $final = pack("B*", $block); # this is wrong $final = pack("B*", substr($block,0,$blocksz)); # this should wor +k print OUT "$final"; $final = ''; # this is strictly unnecssary but does keep variable + clean # undef $final; $block01 = substr($block,-$blocksz); # this moves the upper bloc +k down } $final = pack("B*", substr($block,-$blocksz)); # get final block print OUT "$final"; close OUT; close IN;

    Update:

    Corrected a couple lines in code

    PJ
    use strict; use warnings; use diagnostics;
      I gave the updated code a shot, however I'm still getting extra data added at the block boundaries. It's as if instead of using what's left at the end of a boundary it's adding more that that chunk. Any ideas?? If you think it's better to correspond via email please send me a note at tp@g-c-i.net
        Actually, in looking at the code a second time, the problem is with the $final = pack("B*", $block); statement. It should read $final = pack("B*", substr($block,0,BLOCKSZ); Sorry about that. Please see amended code above. (I used a variable $blocksz in the code in place of BLOCKSZ in this discussion)

        $block = $block01.$block02 creates a single variable, $block, of size BLOCKSZ * 2 (4096 in my code). The substitution works across the read boundary of 2048 between blocks 01 and 02 for this one instance The substitution will fail if the pattern crosses the upper boundery of $block02 since the pattern is incomplete. Thus, after writing out $block01, you move $block02 in to $block01 so that the next pattern substition will catch any pattern that crosses that boundary. Actually, come to think of it, you should be assigning the upper BLOCKSZ of $block to $block02 ie. $block01 = substr($block,-BLOCKSZ).

        As for speed, you could increase the size of your blocks maybe to 32768 or 65536 or larger if you have the memory.

        You're using some pretty big sequences in the substitution regex, I wonder if that isn't your biggest bottleneck. Is it possible to break up your pattern into parts? You might pick up some speed there using several smaller substitutions rather than one big one. I'm not a regex guru (sort of a novice really) but it seems that there is the potential for a lot of backtracking in your regex and that has got to take time. Maybe one of the more experienced monks speak to that.

        The rest of the algorithm should be fairly quick. I would recommend that you move the file open operation open OUT, ">>tmp"; (and the related close op) out of your first loop. That will cut some overhead opening and closing a file. Pack and Unpack are pretty efficient so you probably can't squeeze any more out of thos ops. I'm not sure if this matters any but you don't have to undef $array each time in the first loop. There is a little overhead involved in reinitializing $array each time.
        Setting $array = '' will accomplish the same thing without forcing the loop to recreate $array each time through. Every little bit adds up particulary when a loop repeats tens of thousands of times.

        I'll have to try benchmarking this sometime. Maybe after work ... Update:
        Running a simple benchmark on the undef vs nullifying produced this (786500 is approx the number of reads necessary to absorb a file of ~3Gb in 4K chunks). The second option runs about 17% faster on the first test. And the second compare testing the open and close op ran over 900% faster even on a short run of 3 CPU seconds
        use strict; use warnings; use diagnostics; use Benchmark qw(cmpthese); cmpthese(-60,{a=>sub{for (0..786500){my $array = '1'; undef $array;}}, b=>sub{for (0..786500){my $array = '1'; $array = ''; }}}) +; cmpthese(0,{a=>sub{for (0..10){my $array = '1'; open OUT, ">>tmp"; print OUT "$array"; undef $array; close OUT;}}, b=>sub{open OUT, ">>tmp"; for (0..10){my $array = '1'; print OUT "$array"; undef $array;}}}); + __END__ Benchmark: running a, b, each for at least 60 CPU seconds... a: 62 wallclock secs (60.50 usr + 0.00 sys = 60.50 CPU) @ 1 +.69/s (n=102) b: 64 wallclock secs (62.31 usr + 0.00 sys = 62.31 CPU) @ 1 +.97/s (n=123) Rate a b a 1.69/s -- -15% b 1.97/s 17% -- Benchmark: running a, b, each for at least 3 CPU seconds... a: 11 wallclock secs ( 0.03 usr + 3.75 sys = 3.78 CPU) @ 2 +.12/s (n=8) b: 8 wallclock secs ( 0.00 usr + 3.14 sys = 3.14 CPU) @ 23 +.24/s (n=73) Rate a b a 2.12/s -- -91% b 23.2/s 998% --

        PJ
        use strict; use warnings; use diagnostics;
Re: Out of memory problems
by periapt (Hermit) on Oct 27, 2004 at 14:13 UTC
    BrowserUk wasn't kidding. This is not a simple problem. How about something like below. This way, rather than reduce your original data block to what you need, you build the output block from what you need block by block. I didn't test all of the edge conditions but it worked on the main ones. Let me know if this helps.

    $/ = \384; # 2K blocks my $blocksz = 384; my $final = ''; my $block = ''; my $lastfound = 0; my $blocklen = 0; open IN, "perlmonks66_tmp.txt"; open OUT, ">>pm66out.txt"; #more efficient when parsing more tha +n one block #Extract the necessary data bits my $block01 = <IN>; my $block02 = ''; while ($block02 = <IN>) { # if pattern matches my $tmphold = ''; do{ $block = $block01.$block02; $block02 = ''; pos($block) = 0; # position pointer at begin +ning $blocklen = length($block); $block =~ m/\G # pick up where you left of +f (.*) # match zero or more chars +up to ... 11110100 # match byte marker (0x06D4 +) ( 1 byte) .{8} # match any seq of 8 chars + ( 1 byte) (.{1520}) # capture the next 1520 cha +rs (190 bytes) 11110100 # match byte marker (0x06D4 +) ( 1 byte) .{8} # match any seq of 8 chars + ( 1 byte) (.{464}) # capture next 464 chars + ( 58 bytes) .{1056} # matching any seq of 1056 +chars (132 bytes) /xg ; if(defined(pos($block))){ my ($tmp1,$tmp2,$tmp3) = ($1,$2,$3); # match found ==> use parts $tmphold .= join('',map { $_ ||= '' } ($tmp1,$tmp2,$tm +p3)); $block01 = substr($block,-($blocklen-$blocksz)); }else{ if(length($block) >= 2*$blocksz){ $tmphold .= substring($block,0,$blocksz); $block01 = substr($block,-($blocklen-$blocksz)); }else{ $block01 = $block; $block = ''; } } }while(length($block) >= $blocksz); # save the remaining unmatched/unchecked char +s $final = $tmphold; print OUT "$final"; $final = ''; # this is strictly unnecssary but does keep varia +ble clean } $final = pack("B*", $block01); # get final block print OUT "$final"; close OUT; close IN; exit;

    PJ
    use strict; use warnings; use diagnostics;
      Thanks to BrowserUK I was able to get the following code to work for byte aligned data, however I've ran into a situation where some of the data isn't byte aligned meaning the patterns I'm looking for go across byte boundaries. I'll have to do this at a bit level. Any idea how to tweak the code to do this?? #! perl -sw use strict; use bytes; open IN, '< :raw', $ARGV 0 or die "$ARGV 0 : $!"; open OUT, '> :raw', $ARGV 1 or die "$ARGV 1 : $!"; ## Grab a double buffer load first time so we can check & correct alig +nment local $/ = \768; my $buf = <IN>; ## Read two frames worth ## Check alignment. Assumes the xf4 .191 xf4 is unique per frame? $buf =~ m(\xF4.{191}\xF4); ## Record the offset to the first frame my $offset = $-[0]; ## If there was an offset to the first match if( $offset != 0 ) { ## Chop off the leading junk substr( $buf, 0, $offset, '' ); ## Top up the buffer to two full frames read( IN, $buf, $offset, 768 - $offset ); warn "$offset bytes discarded from front of file."; } ## Process the first two whole frames print OUT unpack 'x2 a190 x2 a58 x132' x 2, $buf ## Now process as before local $/ = \384; ## Read file in 384 byte chunks. while( <IN> ) { print OUT unpack 'x2 a190 x2 a58', $_; } close IN; close OUT;