Okay. There was one definite error in the program I posted here, though funnily enough, the one-liner would have worked.
Due to my affectation of using the -l switch, the frames would have been re-written to the output file with newlines appended. Probably not good given it's a binary file:(
Here's the corrected code. Using printf rather than print 'fixes' the problem. Alternatively, removing the -l would also.
#! perl -slw
use strict;
use bytes;
open IN, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";
open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!";
local $/ = \384; ## Read file in 384 byte chunks.
while( <IN> ) {
printf OUT unpack 'x2 a190 x2 a58', $_;
}
close IN;
close OUT;
As a test, it used this program to generate a test file with 8.5 million records (~3.2GB),
#! perl -slw
use strict;
use bytes;
our $FRAMES ||= 1000;
binmode STDOUT, ':raw';
for ( 1 .. $FRAMES ) {
printf "\xf4X1" . '2' x 188 . "3\xf4X4"
. '5' x 56 . '6' . 'X' x 132;
}
It generates frames of the form:
|X122 ... 223|X455 ... 556XX ... XX
<code>
<p>which after processing should end up as
<code>
122 ... 223455 ... 556
I then used the one-liner to process it.
P:\test>401318-gen.pl -FRAMES=8500000 > 401318.dat
P:\test>dir 401318.dat
Volume in drive P has no label.
Volume Serial Number is BCCA-B4CC
Directory of P:\test
24/10/2004 06:54 3,264,000,000 401318.dat
1 File(s) 3,264,000,000 bytes
0 Dir(s) 51,747,094,528 bytes free
P:\test>perl -C0 -mbytes -e"BEGIN{$/=\384}" -ne"print unpack 'x2 a190
+x2 a58', $_" <401318.dat >401318.out
P:\test>dir 401318*
Volume in drive P has no label.
Volume Serial Number is BCCA-B4CC
Directory of P:\test
24/10/2004 06:24 193 401318-gen.pl
24/10/2004 06:54 3,264,000,000 401318.dat
24/10/2004 07:02 2,108,000,000 401318.out
24/10/2004 06:24 303 401318.pl
4 File(s) 5,372,000,496 bytes
0 Dir(s) 49,639,051,264 bytes free
As you can see, the 8.500,000 x 384 = 3,264,000,000 bytes of the input file are reduced to 2,108,000,000 ( 8.5e6 x 248) as required. A visual inspection of the first and last 10 frames show them to be correct.
The total time taken to process using the one-liner was 4 minutes! (A bit better than your 7 or 8 hours:)
You may also notice that I modified the unpack format to use 'a' rather than 'A' for the retained chunks. This probably has no effect, but according to perlfunc 'a' is defined for binary data whereas 'A' is for ASCII. The main difference is (I think) that in the event of the variable not containing enough data to satisfy the format, then the former null pads, and the latter space pads, which should not be a concern for this application.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
|