in reply to pack/unpack binary editing

I'm looking at a 9 gig file and unpacking would result in 72 gigs.

There is absolutely no reason to convert your whole file into asciized binary.

You can access every bit in your file by using sysseek to position the read head and sysread to read a single byte or 4 bytes or 20.

You can then manipulate the bits of the bytes you read in using vec or boolean logic, or by unpacking to '0's & '1's if you prefer.

Once you have twiddled your bits, you can write them back by repositioning the file position with sysseek and write out the modified bytes using syswrite.

The only caveat is if you want delete or insert bytes--or worse, odd bits--, then you would have to do something a litte different. That is usually not the case though. Most file formats are at the very least byte aligned.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^2: pack/unpack binary editing
by nobull (Friar) on Feb 08, 2005 at 12:51 UTC
    Why do you recommend sysseek/sysread/syswrite over just opening the file in binary mode and using seek/read/print?

      Mostly because I am sure that I won't be getting any interference from IOLayers, Unicode conversions or whatever. That may be paranoia, but I believe that I have had the situation where a random piece of binary data has looked sufficiently like unicode to cause is to be upgraded by some action. This may have been on 5.6.1 before the unicode support was sorted out--by why risk it?

      Also, if you're randomly accessing the file and reading bytes, any buffering Perl or the C-runtime does is unlikely to be helpful. I have some evidence that on Win32, you can get a non-useful interaction between PerlIO's caching efforts and those done by the OS itself. One I can avoid, the other not, so I avoid the one I can.

      Let me turn your question around: Why wouldn't you use sysread/syswrite/sysseek when processing a binary file?


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        Mostly because I am sure that I won't be getting any interference from IOLayers, Unicode conversions or whatever. That may be paranoia, but I believe that I have had the situation where a random piece of binary data has looked sufficiently like unicode to cause is to be upgraded by some action.
        While indeed there are situations in which it can be necessary to use sys*(), it is also true that binmode(), or open()'s '<:raw' mode should that care of your concernes altogether.

        Then one can use Perl's typical IO {operators,functions}. Since the OP underlined that he has to process a whole 9Gb file, chances are that it may be possible to do it one chunk (whatever this may mean, size-wise) at a time with good'ol while (<$fh>), provided that $/ is set accordingly (e.g. local $/=\512).

        This may have been on 5.6.1 before the unicode support was sorted out--by why risk it?
        AFAIKnew unicode support has not been "sorted out", nay, notwithstanding the fact that I do not need it nor have I ever used it, it is my understanding that it's being constantly improved. What has been sorted out is unicode automatic handling (depending on an environment variable, which somehow forced *NIX users to use binmode() too, something they're not used to!)
Re^2: pack/unpack binary editing
by tperdue (Sexton) on Feb 08, 2005 at 13:41 UTC
    Actually I am doing something a little different. I have to work on bit boundaries and not byte. I may have to throw away a bit here and there.

      Hmm. Then you have a problem that will require a little more effort. How will you determine which bits to throw away?

      If you can do it whilst moving through the file in a single direction, or if you can construct a set of "editing instruction" ("delete bit3/byte 700004", "insert '010' at bit6 of byte 3002" etc.), whilst treating the file read-only, then sort those into byte/bit sequence.

      You can then do the editing in a second linear pass through the file. You would keep a running buffer ( 0 - 7 bits ) of any odd bits. Appending those to the front of each buffer as you read it in, make any modification to that chunk of bits and then write int( bits--in-memory/8) bytes back out, retaining the leftover bits. Rince and repeat till done.

      The problem with that is that when you re-order the editing instructions, you will need to acount for any shifts in byte/bit positions in order to account for teh effects of editing that will be done by earlier sequences. Not a hugely onorous task, but one that would need thourough testing on small files before you starting screwing with the big one.

      It really depends on your answer to the question I posed first. How will the sequence of edits be determined. The answer to that will define the best strategy.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        I have a frame of 480 bytes in which I have a sync word of 10 bits long. I need to be able to find the start of my framing which may not occur on a byte boundary. I also have to take into consideration that the file may skew occationally, so I'd have to check frame by frame for my sync word. I've done this successfully using pack and unpack but it takes forever. I would like to slide through the file without having to pack or unpack.