Re^7: Out of memory problems

Okay. There was one definite error in the program I posted here, though funnily enough, the one-liner would have worked.

Due to my affectation of using the -l switch, the frames would have been re-written to the output file with newlines appended. Probably not good given it's a binary file:(

Here's the corrected code. Using printf rather than print 'fixes' the problem. Alternatively, removing the -l would also.

#! perl -slw
use strict;
use bytes; 

open IN,  '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";
open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!";

local $/ = \384; ## Read file in 384 byte chunks.

while( <IN> ) {
    printf OUT unpack 'x2 a190 x2 a58', $_;
}

close IN;
close OUT;
[download]

As a test, it used this program to generate a test file with 8.5 million records (~3.2GB),

#! perl -slw
use strict;
use bytes;

our $FRAMES ||= 1000;

binmode STDOUT, ':raw';

for ( 1 .. $FRAMES ) {
    printf "\xf4X1" . '2' x 188 . "3\xf4X4" 
        . '5' x 56 . '6' . 'X' x 132;
}
[download]

It generates frames of the form:

|X122 ... 223|X455 ... 556XX ... XX
<code> 

<p>which after processing should end up as
<code>
122 ... 223455 ... 556
[download]

I then used the one-liner to process it.

P:\test>401318-gen.pl -FRAMES=8500000 > 401318.dat

P:\test>dir 401318.dat
 Volume in drive P has no label.
 Volume Serial Number is BCCA-B4CC

 Directory of P:\test

24/10/2004  06:54     3,264,000,000 401318.dat
               1 File(s)  3,264,000,000 bytes
               0 Dir(s)  51,747,094,528 bytes free

P:\test>perl -C0 -mbytes -e"BEGIN{$/=\384}" -ne"print unpack 'x2 a190 
+x2 a58', $_" <401318.dat >401318.out

P:\test>dir 401318*
 Volume in drive P has no label.
 Volume Serial Number is BCCA-B4CC

 Directory of P:\test

24/10/2004  06:24               193 401318-gen.pl
24/10/2004  06:54     3,264,000,000 401318.dat
24/10/2004  07:02     2,108,000,000 401318.out
24/10/2004  06:24               303 401318.pl
               4 File(s)  5,372,000,496 bytes
               0 Dir(s)  49,639,051,264 bytes free
[download]

As you can see, the 8.500,000 x 384 = 3,264,000,000 bytes of the input file are reduced to 2,108,000,000 ( 8.5e6 x 248) as required. A visual inspection of the first and last 10 frames show them to be correct.

The total time taken to process using the one-liner was 4 minutes! (A bit better than your 7 or 8 hours:)

You may also notice that I modified the unpack format to use 'a' rather than 'A' for the retained chunks. This probably has no effect, but according to perlfunc 'a' is defined for binary data whereas 'A' is for ASCII. The main difference is (I think) that in the event of the variable not containing enough data to satisfy the format, then the former null pads, and the latter space pads, which should not be a concern for this application.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re^7: Out of memory problems Select or Download Code

Replies are listed 'Best First'.
Re^8: Out of memory problems by tperdue (Sexton) on Oct 26, 2004 at 10:48 UTC
I tried the above program but received an "Invalid conversion in printf:" error. I changed the printf to print and the program ran great. I'm now getting what I'd expect out of the program and it's extremely fast. Thanks for all you time and patience. What would you do if you had an occation where the data wasn't on cut (ie. the first pattern was offset by 10 bytes) and the user had no way of knowing?	[reply]
Re^9: Out of memory problems by BrowserUk (Patriarch) on Oct 26, 2004 at 11:39 UTC
What would you do if you had an occation where the data wasn't on cut (ie. the first pattern was offset by 10 bytes) and the user had no way of knowing? This reads a two framesized chunk and uses a regex to discover the alignment of the first full frame within it. If the offset is non-zero, then it discards that many bytes from the front of buffer (and issues a warning to indicate that), tops up the buffer to two full frames thereby aligning the read pointer to the start of the 3rd frame. Processes and outputs the first two full frames and then processes the rest a frame at a time as before. #! perl -sw use strict; use bytes; open IN, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; ## Grab a double buffer load first time so we can check & correct alig +nment local $/ = \768; my $buf = <IN>; ## Read two frames worth ## Check alignment. Assumes the xf4 .191 xf4 is unique per frame? $buf =~ m[(\xF4.{191}\xF4)]; ## Record the offset to the first frame my $offset = $-[0]; ## If there was an offset to the first match if( $offset != 0 ) { ## Chop off the leading junk substr( $buf, 0, $offset, '' ); ## Top up the buffer to two full frames read( IN, $buf, $offset, 768 - $offset ); warn "$offset bytes discarded from front of file."; } ## Process the first two whole frames print OUT unpack 'x2 a190 x2 a58 x132' x 2, $buf ## Now process as before local $/ = \384; ## Read file in 384 byte chunks. while( <IN> ) { print OUT unpack 'x2 a190 x2 a58', $_; } close IN; close OUT; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^10: Out of memory problems by tperdue (Sexton) on Oct 26, 2004 at 12:31 UTC
Sorry for being a headache here but doing binary with perl is new to me. What if the file isn't byte aligned, meaning I need to work at a bit level?	[reply]
Re^11: Out of memory problems by BrowserUk (Patriarch) on Oct 26, 2004 at 12:45 UTC
Re^12: Out of memory problems by tperdue (Sexton) on Oct 26, 2004 at 13:14 UTC
Re^10: Out of memory problems by tperdue (Sexton) on Oct 26, 2004 at 16:50 UTC
Had an error when I ran the code. I'm getting a "Not enough arguments for read new $offset in the $buf .= read ( IN, $offset ); line. I modified it to $buf .= read ( IN, $buf, $offset) then I got an 'x' outside of string in unpack error for the print OUT unpack 'x2 a190 x2 a58 x132' x 2, $buf line. I replaced the x with * but get a numerical error. What am I doing wrong?	[reply]
Re^11: Out of memory problems by BrowserUk (Patriarch) on Oct 26, 2004 at 17:44 UTC