in reply to Re^6: Out of memory problems
in thread Out of memory problems

Okay. There was one definite error in the program I posted here, though funnily enough, the one-liner would have worked.

Due to my affectation of using the -l switch, the frames would have been re-written to the output file with newlines appended. Probably not good given it's a binary file:(

Here's the corrected code. Using printf rather than print 'fixes' the problem. Alternatively, removing the -l would also.

#! perl -slw use strict; use bytes; open IN, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; local $/ = \384; ## Read file in 384 byte chunks. while( <IN> ) { printf OUT unpack 'x2 a190 x2 a58', $_; } close IN; close OUT;

As a test, it used this program to generate a test file with 8.5 million records (~3.2GB),

#! perl -slw use strict; use bytes; our $FRAMES ||= 1000; binmode STDOUT, ':raw'; for ( 1 .. $FRAMES ) { printf "\xf4X1" . '2' x 188 . "3\xf4X4" . '5' x 56 . '6' . 'X' x 132; }

It generates frames of the form:

|X122 ... 223|X455 ... 556XX ... XX <code> <p>which after processing should end up as <code> 122 ... 223455 ... 556

I then used the one-liner to process it.

P:\test>401318-gen.pl -FRAMES=8500000 > 401318.dat P:\test>dir 401318.dat Volume in drive P has no label. Volume Serial Number is BCCA-B4CC Directory of P:\test 24/10/2004 06:54 3,264,000,000 401318.dat 1 File(s) 3,264,000,000 bytes 0 Dir(s) 51,747,094,528 bytes free P:\test>perl -C0 -mbytes -e"BEGIN{$/=\384}" -ne"print unpack 'x2 a190 +x2 a58', $_" <401318.dat >401318.out P:\test>dir 401318* Volume in drive P has no label. Volume Serial Number is BCCA-B4CC Directory of P:\test 24/10/2004 06:24 193 401318-gen.pl 24/10/2004 06:54 3,264,000,000 401318.dat 24/10/2004 07:02 2,108,000,000 401318.out 24/10/2004 06:24 303 401318.pl 4 File(s) 5,372,000,496 bytes 0 Dir(s) 49,639,051,264 bytes free

As you can see, the 8.500,000 x 384 = 3,264,000,000 bytes of the input file are reduced to 2,108,000,000 ( 8.5e6 x 248) as required. A visual inspection of the first and last 10 frames show them to be correct.

The total time taken to process using the one-liner was 4 minutes! (A bit better than your 7 or 8 hours:)

You may also notice that I modified the unpack format to use 'a' rather than 'A' for the retained chunks. This probably has no effect, but according to perlfunc 'a' is defined for binary data whereas 'A' is for ASCII. The main difference is (I think) that in the event of the variable not containing enough data to satisfy the format, then the former null pads, and the latter space pads, which should not be a concern for this application.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^8: Out of memory problems
by tperdue (Sexton) on Oct 26, 2004 at 10:48 UTC
    I tried the above program but received an "Invalid conversion in printf:" error. I changed the printf to print and the program ran great. I'm now getting what I'd expect out of the program and it's extremely fast. Thanks for all you time and patience. What would you do if you had an occation where the data wasn't on cut (ie. the first pattern was offset by 10 bytes) and the user had no way of knowing?
      What would you do if you had an occation where the data wasn't on cut (ie. the first pattern was offset by 10 bytes) and the user had no way of knowing?

      This reads a two framesized chunk and uses a regex to discover the alignment of the first full frame within it. If the offset is non-zero, then it discards that many bytes from the front of buffer (and issues a warning to indicate that), tops up the buffer to two full frames thereby aligning the read pointer to the start of the 3rd frame. Processes and outputs the first two full frames and then processes the rest a frame at a time as before.

      #! perl -sw use strict; use bytes; open IN, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; ## Grab a double buffer load first time so we can check & correct alig +nment local $/ = \768; my $buf = <IN>; ## Read two frames worth ## Check alignment. Assumes the xf4 .191 xf4 is unique per frame? $buf =~ m[(\xF4.{191}\xF4)]; ## Record the offset to the first frame my $offset = $-[0]; ## If there was an offset to the first match if( $offset != 0 ) { ## Chop off the leading junk substr( $buf, 0, $offset, '' ); ## Top up the buffer to two full frames read( IN, $buf, $offset, 768 - $offset ); warn "$offset bytes discarded from front of file."; } ## Process the first two whole frames print OUT unpack 'x2 a190 x2 a58 x132' x 2, $buf ## Now process as before local $/ = \384; ## Read file in 384 byte chunks. while( <IN> ) { print OUT unpack 'x2 a190 x2 a58', $_; } close IN; close OUT;

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        Sorry for being a headache here but doing binary with perl is new to me. What if the file isn't byte aligned, meaning I need to work at a bit level?
        Had an error when I ran the code. I'm getting a "Not enough arguments for read new $offset in the $buf .= read ( IN, $offset ); line. I modified it to $buf .= read ( IN, $buf, $offset) then I got an 'x' outside of string in unpack error for the print OUT unpack 'x2 a190 x2 a58 x132' x 2, $buf line. I replaced the x with * but get a numerical error. What am I doing wrong?