Out of memory inefficient code?

tuxtutorials has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl monks, need guidance in increasing performance of Perl script. Please see script below:

use strict;
use warnings;

open(OUTNOMATCH, ">nomatch.out") or die "Couldn't write to file $!";
open(OUTMATCH, ">match.out") or die "Couldn't write to file $!";

sub match_internal {  #Separate internal/external addresses

use Net::IP::Match::Regexp qw( create_iprange_regexp match_ip );

    my $my_ip = $_[0];
    my $regexp = create_iprange_regexp(
       qw( 192.168.0.0/16 10.10.0.0/16 192.3.3.0/23 192.168.24.0/21 10
+.0.0.0/8 )
    );
        if (match_ip($my_ip, $regexp)) {
            print OUTMATCH "$my_ip\n";
        }
        else {
            print OUTNOMATCH "$my_ip\n";
        }
}

sub uniq_ip { # locate and remove duplicate addresses
    my @list = @_;
    my @uniq_ip = keys %{{ map { $_ => 1 } @list }};
}

sub sortme { # sort all addresses
    my @array = @_;
    my %hashTemp = map { $_ => 1 } @array;
    my @array_out = sort keys %hashTemp;
}

sub main_loop { #main loop that performs all logic and munging

my @item;
my @uniq_out;

    while (<>) {
        (my $field1, my $field2) = split /DST=/, $_;
            if ($field2 =~ m/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/) {
                push (@item, $1);
            }
    }
    @uniq_out = uniq_ip(@item);
        foreach(@uniq_out) {
            match_internal($_);
        }
    }

main_loop();
[download]

Here is a sample of the data that I am feeding it:

Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 
+DST=192.168.3.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10523 DF PROTO=
+TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 
+DST=192.168.3.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10525 DF PROTO=
+TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 
+DST=192.168.4.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10527 DF PROTO=
+TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 
+DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10529 DF PROTO
+=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 
+DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10531 DF PROTO
+=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 
+DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10533 DF PROTO
+=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 
+DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10535 DF PROTO
+=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
[download]

I am munging log files around 5.5G in size and getting out of memory errors!. Any advice on script logic would be great. Thanks

Comment on Out of memory inefficient code? Select or Download Code

Replies are listed 'Best First'.
Re: Out of memory inefficient code? by BrowserUk (Patriarch) on Feb 11, 2010 at 21:27 UTC
You create a list of IPs you find in `@item`; You then pass them as a list into `uniq_ip(@item)` Where you make a copy of that list `my @list = @_;` You then make another copy when you pass it to map `map { $_ => 1 } @list` Which you use to create an anonymous hash `{ map { $_ => 1 } @list }` Then you create another list from its uniq keys `keys %{{ map { $_ => 1 } @list }}` Which you assign to an array `my @uniq_ip = keys %{{ map { $_ => 1 } @list }};` Which is returned as another list (last statement of the subroutine). Which you assign to another array `@uniq_out = uniq_ip(@item);` Which you iterate over `foreach(@uniq_out)` That makes 9 or 10 copies of the list. It's no wonder you're running out of memory. Try this. It avoids most of those copies: use strict; use warnings; open(OUTNOMATCH, ">nomatch.out") or die "Couldn't write to file $!"; open(OUTMATCH, ">match.out") or die "Couldn't write to file $!"; sub match_internal { #Separate internal/external addresses use Net::IP::Match::Regexp qw( create_iprange_regexp match_ip ); my $my_ip = $_[0]; my $regexp = create_iprange_regexp( qw( 192.168.0.0/16 10.10.0.0/16 192.3.3.0/23 192.168.24.0/21 10 +.0.0.0/8 ) ); if (match_ip($my_ip, $regexp)) { print OUTMATCH "$my_ip\n"; } else { print OUTNOMATCH "$my_ip\n"; } } sub sortme { # sort all addresses my @array = @_; my %hashTemp = map { $_ => 1 } @array; my @array_out = sort keys %hashTemp; } sub main_loop { #main loop that performs all logic and munging my %uniq; while (<>) { (my $field1, my $field2) = split /DST=/, $_; if ($field2 =~ m/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/) { $uniq{ $1 } = 1; } } match_internal( $_ ) while $_ = each %uniq; } main_loop(); [download] You're also re-creating this: `my $regexp = create_iprange_regexp( qw( 192.168.0.0/16 10.10.0.0/16 192.3.3.0/23 192.168.24.0/21 10 +.0.0.0/8 ) );` [download] for every uniq IP you check...which probably doesn't cost you in extra memory, but is hugely wasteful of cpu (time). Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l] [select]
Re: Out of memory inefficient code? by ikegami (Patriarch) on Feb 11, 2010 at 21:33 UTC
You're only extracting the IP addresses, so let's find out how many you have. `$ perl -wE'say 5.51024 / length("Nov 17 11:09:25 proxy02 kernel: OUTP +UT LOGIN= OUT=eth0 SRC=11.11.11.0 DST=192.168.3.1 LEN=1420 TOS=0x00 P +REC=0x00 TTL=64 ID=10523 DF PROTO=TCP SPT=3128 DPT=1921 WINDOW=16659 +RES=0x00 ACK URGP=0\n")' 29.0309278350515` [download] So you have about 30M IP addresses, many of which are duplicates. `$ perl -MDevel::Size=total_size -wE'my %h; for (1..100) { ++$h{ pack " +C4", 1,2,3,$_ } } say total_size(\%h)/100' 47.56` [download] At a rate of roughly 50 bytes per IP, you'd need 30M 50 bytes = 1.5GB just for the data. That's a lot, but it might be small enough to avoid getting fancy, especially since many are duplicates. use strict; use warnings; my $fn_in = 'internal.out'; my $fn_ex = 'external.out'; my @internals = ( [ pack('C4', 10,0,0,0 ), pack('C4', 255,0,0,0 ) ], # [ pack('C4', 10,10,0,0 ), pack('C4', 255,255,0,0 ) ], [ pack('C4', 192,3,3,0 ), pack('C4', 255,255,254,0 ) ], [ pack('C4', 192,168,0,0 ), pack('C4', 255,255,0,0 ) ], # [ pack('C4', 192,168,24,0 ), pack('C4', 255,255,248,0 ) ], ); sub is_internal { my $packed_ip = shift; for (@internals) { return 1 if $packed_ip & $_->[1] eq $_->[0]; } return 0; } sub extract { open(my $fh_in, '>', $fn_in) or die("Can't create file $fn_in: $!\n"); open(my $fh_ex, '>', $fn_ex) or die("Can't create file $fn_ex: $!\n"); my %seen; while (<>) { my $ip = /DST=(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/ or next; my $packed_ip = pack('C4', split(/\./, $ip)); next if $seen{$packed_ip}++; print { is_internal($packed_ip) ? $fh_in : $fh_ex } "$ip\n"; } undef %seen; # Free mem. } sub sort { my ($fn) = @_; my @packed_ips; { open(my $fh, '<', $fn) or die("Can't open file $fn: $!\n"); while (<$fh>) { chomp; my $packed_ip = pack('C4', split(/\./, $ip)); push @packed_ips = $packed_ip; } } @packed_ips = sort @packed_ips; { open(my $fh, '>', $fn) or die("Can't create file $fn: $!\n"); for (@packed_ips) { my $ip = join('.', unpack('C4', $packed_ip)); print("$ip\n"); } } } sub main { extract(); sort_file($fn_in); sort_file($fn_ex); } main(); [download] Using a trie instead of a hash would reduce memory usage, and it would provide the results in sorted order. By the way, two of your internal ranges are redundant with other ranges. I commented them out.	[reply] [d/l] [select]
Re: Out of memory inefficient code? by ikegami (Patriarch) on Feb 11, 2010 at 22:20 UTC
Here, this offloads the memory intensive stuff to command line util `sort`. It can sort stuff that doesn't fit in memory, and it can remove duplicates in the process. use strict; use warnings; use Fcntl qw( SEEK_SET ); use File::Temp qw( ); use IPC::Open3 qw( ); my $fn_in = 'internal.out'; my $fn_ex = 'external.out'; my @internals = ( [ pack('C4', 10,0,0,0 ), pack('C4', 255,0,0,0 ) ], # [ pack('C4', 10,10,0,0 ), pack('C4', 255,255,0,0 ) ], [ pack('C4', 192,3,3,0 ), pack('C4', 255,255,254,0 ) ], [ pack('C4', 192,168,0,0 ), pack('C4', 255,255,0,0 ) ], # [ pack('C4', 192,168,24,0 ), pack('C4', 255,255,248,0 ) ], ); sub is_internal { my $packed_ip = shift; for (@internals) { return 1 if $packed_ip & $_->[1] eq $_->[0]; } return 0; } sub process_result { my ($child, $code) = @_; die("Can't collect child $child: $!\n") if $code == -1; my $s = $code & 127; die("Child $child was killed from signal $s\n"); my $e = $code >> 8; die("Child $child exited with code $e\n"); } sub sort_file { my ($fh, $fn) = @_; # open3 works better with globs open(local TO_SORT, '<&', $fh) or die("Can't dup input handle: $!\n"); pipe(local TO_CUT, local FR_SORT) or die("Can't create pipe: $!\n"); open(local FR_CUT, '>', $fn) or die("Can't create file \"$fn\": $!\n"); my $sort_pid = open3('<&TO_SORT', '>&FR_SORT', '>&STDERR', sort => ( -u => () )); my $cut_pid = open3('<&TO_CUT', '>&FR_CUT', '>&STDERR', cut => ( -f => '2-' )); process_result('sort', waitpid($sort_pid, 0)); process_result('cut', waitpid($cut_pid, 0)); } sub main { my $fh_in = tempfile(); my $fh_ex = tempfile(); while (<>) { my $ip = /DST=(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/ or next; my $packed_ip = pack('C4', split(/\./, $ip)); print { is_internal($packed_ip) ? $fh_in : $fh_ex } unpack('H8', $packed_ip), "\t", $ip, "\n"; } seek($fh_in, 0, SEEK_SET) or die("Can't seek temp file: $!\n"); seek($fh_ex, 0, SEEK_SET) or die("Can't seek temp file: $!\n"); sort_file($fh_in, $fn_in); sort_file($fh_ex, $fn_ex); } main(); [download] Untested.	[reply] [d/l] [select]