FreckledAvenger has asked for the wisdom of the Perl Monks concerning the following question:

I have written a program to parse a custom application log file. It comprises of "events" that I am interssted amongst other junk that I am not. Each event has a number of attributes and the script it written to accomodate new attributes that are are added from time to time (which happens in the application frequently) to avoid code modification.

The storage mechanism used is associative arrays where the key used is the event number catenated with a "|" and the attribute number (e.g. $event_number."|$attribute_number"). All events are read into memory and then (at the end of reading the file) are sent to a comma seprated format.

This works fine for small logs but when I have started to process large logs recently I have found memory usage goes out of control. I therefore wrote a quick test script (based on the storage structure used by my program) to see if how efficient associative arrays actually are:
for ( $index1=0; $index1 < 1000; $index1++ ) { for ( $index2=0; $index2 < 1000; $index2++ ) { $array{$index1."|$index2"} = "A"; } }
At the end this nearly hit 100MB of RAM before this script finished. Obviously large logs take ages to process as the machine I run on pages itself to near death. Basically the memory used by the data structure is far greater than the data actually stored.

Is there a more efficient way to store this information in memory? Or should I just accept that I will have to begin storing the material in a temporary file (something that I want to avoid due to the potential slow down involved).

Anyone got any good ideas?

Replies are listed 'Best First'.
Re: limiting associative array memory usage
by how do i know if the string is regular expression (Initiate) on Apr 07, 2001 at 16:12 UTC
    I would use dbmopen() or tie(). Using these you can bind the hash to a physical file. This made one of my scripts twice as fast, because it was faster doing the IO calls then having the memory fill up and start using swap space.

    See: perlman:perltie.

    - FrankG

Re: limiting associative array memory usage
by Masem (Monsignor) on Apr 07, 2001 at 15:29 UTC
    Are your events and attributes numerical, or can they proceeds into number?

    If so, then using the hash as you are is definitely inefficient. Your 'data' above only needs 1MB to store data, but because you're using a string, that's more perl needs to keep in memory.

    I'd suggest two approaches. I'll assume that event numbers are *not* sparse (that is, you most likely generate consecutive event numbers from your logging program). If the attribute numbers are also not sparse, then use a flatted 2d-to-1d array:

    sub getArray { my ( $event, $attrib ) = @_; return $array[ $event*MAX_ATTRIBS + $attrib ]; } sub putArray { my ( $event, $attrib, $value ) = @_; $array[ $event*MAX_ATTRIBS + $attrib ] = $value; }
    (Writing out results as 'event|attrib' can be done in the last step).

    Alternatively, if the attributes are sparse, use a hash of lists, hashing on the attribute...

    sub putArray { my( $event, $attrib, $value ) = @_; $hash{ $attrib } ||= []; # create if not def $hash{ $attrib }->[ $event ] = $value; } sub getArray { my( $event, $attrib ) = @_; return $hash{ $attrib }->[ $event ]; }
    Mind you, in both cases, if what you are putting in the array is 'large' (as large as a few-character text string), then in your 1000x1000 case you'll be grinding memory. If you have sparse attributes, you probably can get away with larger strings until this happens. But in either case, you will hit something with large events/attributes values and meaningful data content. At which point I would resort to temporary files, and drop back to the 'flattened array' approach:
    sub putArray { my ( $event, $attrib, $value ) = @_; open FILE, '>' . sprintf( "%06d", $event ) . '-' . sprintf( "%06d", $attrib ) or die $!; print FILE $value; close FILE; } sub getArray { my ( $event, $attrib ) = @_; return if !( -e sprintf( "%06d", $event ) . '-' . sprintf( "%06d", $attrib ) ); open FILE, '<' . sprintf( "%06d", $event ) . '-' . sprintf( "%06d", $attrib ) or die $!; my $value = <FILE>; close FILE; return $value; }
    Yes, it will be slow, but you will not be thrashing memory as your events and attributes expand.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re (tilly) 1: limiting associative array memory usage
by tilly (Archbishop) on Apr 07, 2001 at 23:22 UTC
    FrankG is right on target. I would recommend Berkeley DB, either through DB_File or BerkeleyDB. I believe that it will be substantially more efficient memorywise than native Perl data structures. So you may be able to avoid using a temp file by using an in memory database.

    Alternately if you have a database, depending on the types of manipulations you are doing it may be easier and faster to access it with DBI, create a scratch table, populate that, and then pull data back out rather than writing your own processing logic.

    A few other notes. First of all judging from your code you are still using C-style loops. Code that uses C-style loops will (like C) have various indexing errors (off by one, fencepost) as among your most common bugs. If you switch to using foreach-type loops you can eliminate that. Also I find it very helpful to use strict religiously which you don't appear to be doing. Sure it is a (small) pain to type my everywhere. But it catches so many typos that it pays off in spades...