I'm looking at a 9 gig file and unpacking would result in 72 gigs.
There is absolutely no reason to convert your whole file into asciized binary.
You can access every bit in your file by using sysseek to position the read head and sysread to read a single byte or 4 bytes or 20.
You can then manipulate the bits of the bytes you read in using vec or boolean logic, or by unpacking to '0's & '1's if you prefer.
Once you have twiddled your bits, you can write them back by repositioning the file position with sysseek and write out the modified bytes using syswrite.
The only caveat is if you want delete or insert bytes--or worse, odd bits--, then you would have to do something a litte different. That is usually not the case though. Most file formats are at the very least byte aligned.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] |
Why do you recommend sysseek/sysread/syswrite over just opening the file in binary mode and using seek/read/print?
| [reply] |
Mostly because I am sure that I won't be getting any interference from IOLayers, Unicode conversions or whatever. That may be paranoia, but I believe that I have had the situation where a random piece of binary data has looked sufficiently like unicode to cause is to be upgraded by some action. This may have been on 5.6.1 before the unicode support was sorted out--by why risk it?
Also, if you're randomly accessing the file and reading bytes, any buffering Perl or the C-runtime does is unlikely to be helpful. I have some evidence that on Win32, you can get a non-useful interaction between PerlIO's caching efforts and those done by the OS itself. One I can avoid, the other not, so I avoid the one I can.
Let me turn your question around: Why wouldn't you use sysread/syswrite/sysseek when processing a binary file?
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] |
Actually I am doing something a little different. I have to work on bit boundaries and not byte. I may have to throw away a bit here and there.
| [reply] |
Hmm. Then you have a problem that will require a little more effort. How will you determine which bits to throw away?
If you can do it whilst moving through the file in a single direction, or if you can construct a set of "editing instruction" ("delete bit3/byte 700004", "insert '010' at bit6 of byte 3002" etc.), whilst treating the file read-only, then sort those into byte/bit sequence.
You can then do the editing in a second linear pass through the file. You would keep a running buffer ( 0 - 7 bits ) of any odd bits. Appending those to the front of each buffer as you read it in, make any modification to that chunk of bits and then write int( bits--in-memory/8) bytes back out, retaining the leftover bits. Rince and repeat till done.
The problem with that is that when you re-order the editing instructions, you will need to acount for any shifts in byte/bit positions in order to account for teh effects of editing that will be done by earlier sequences. Not a hugely onorous task, but one that would need thourough testing on small files before you starting screwing with the big one.
It really depends on your answer to the question I posed first. How will the sequence of edits be determined. The answer to that will define the best strategy.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] |