tuxtutorials has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl monks, need guidance in increasing performance of Perl script. Please see script below:
use strict; use warnings; open(OUTNOMATCH, ">nomatch.out") or die "Couldn't write to file $!"; open(OUTMATCH, ">match.out") or die "Couldn't write to file $!"; sub match_internal { #Separate internal/external addresses use Net::IP::Match::Regexp qw( create_iprange_regexp match_ip ); my $my_ip = $_[0]; my $regexp = create_iprange_regexp( qw( 192.168.0.0/16 10.10.0.0/16 192.3.3.0/23 192.168.24.0/21 10 +.0.0.0/8 ) ); if (match_ip($my_ip, $regexp)) { print OUTMATCH "$my_ip\n"; } else { print OUTNOMATCH "$my_ip\n"; } } sub uniq_ip { # locate and remove duplicate addresses my @list = @_; my @uniq_ip = keys %{{ map { $_ => 1 } @list }}; } sub sortme { # sort all addresses my @array = @_; my %hashTemp = map { $_ => 1 } @array; my @array_out = sort keys %hashTemp; } sub main_loop { #main loop that performs all logic and munging my @item; my @uniq_out; while (<>) { (my $field1, my $field2) = split /DST=/, $_; if ($field2 =~ m/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/) { push (@item, $1); } } @uniq_out = uniq_ip(@item); foreach(@uniq_out) { match_internal($_); } } main_loop();
Here is a sample of the data that I am feeding it:
Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 +DST=192.168.3.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10523 DF PROTO= +TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0 Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 +DST=192.168.3.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10525 DF PROTO= +TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0 Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 +DST=192.168.4.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10527 DF PROTO= +TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0 Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 +DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10529 DF PROTO +=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0 Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 +DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10531 DF PROTO +=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0 Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 +DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10533 DF PROTO +=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0 Nov 17 11:09:25 proxy02 kernel: OUTPUT LOGIN= OUT=eth0 SRC=11.11.11.0 +DST=192.168.43.1 LEN=1420 TOS=0x00 PREC=0x00 TTL=64 ID=10535 DF PROTO +=TCP SPT=3128 DPT=1921 WINDOW=16659 RES=0x00 ACK URGP=0
I am munging log files around 5.5G in size and getting out of memory errors!. Any advice on script logic would be great. Thanks

Replies are listed 'Best First'.
Re: Out of memory inefficient code?
by BrowserUk (Patriarch) on Feb 11, 2010 at 21:27 UTC

    1. You create a list of IPs you find in @item;
    2. You then pass them as a list into uniq_ip(@item)
    3. Where you make a copy of that list my @list = @_;
    4. You then make another copy when you pass it to map map { $_ => 1 } @list
    5. Which you use to create an anonymous hash { map { $_ => 1 } @list }
    6. Then you create another list from its uniq keys keys %{{ map { $_ => 1 } @list }}
    7. Which you assign to an array my @uniq_ip = keys %{{ map { $_ => 1 } @list }};
    8. Which is returned as another list (last statement of the subroutine).
    9. Which you assign to another array @uniq_out = uniq_ip(@item);
    10. Which you iterate over foreach(@uniq_out)

    That makes 9 or 10 copies of the list. It's no wonder you're running out of memory.

    Try this. It avoids most of those copies:

    use strict; use warnings; open(OUTNOMATCH, ">nomatch.out") or die "Couldn't write to file $!"; open(OUTMATCH, ">match.out") or die "Couldn't write to file $!"; sub match_internal { #Separate internal/external addresses use Net::IP::Match::Regexp qw( create_iprange_regexp match_ip ); my $my_ip = $_[0]; my $regexp = create_iprange_regexp( qw( 192.168.0.0/16 10.10.0.0/16 192.3.3.0/23 192.168.24.0/21 10 +.0.0.0/8 ) ); if (match_ip($my_ip, $regexp)) { print OUTMATCH "$my_ip\n"; } else { print OUTNOMATCH "$my_ip\n"; } } sub sortme { # sort all addresses my @array = @_; my %hashTemp = map { $_ => 1 } @array; my @array_out = sort keys %hashTemp; } sub main_loop { #main loop that performs all logic and munging my %uniq; while (<>) { (my $field1, my $field2) = split /DST=/, $_; if ($field2 =~ m/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/) { $uniq{ $1 } = 1; } } match_internal( $_ ) while $_ = each %uniq; } main_loop();

    You're also re-creating this:

    my $regexp = create_iprange_regexp( qw( 192.168.0.0/16 10.10.0.0/16 192.3.3.0/23 192.168.24.0/21 10 +.0.0.0/8 ) );

    for every uniq IP you check...which probably doesn't cost you in extra memory, but is hugely wasteful of cpu (time).


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Out of memory inefficient code?
by ikegami (Patriarch) on Feb 11, 2010 at 21:33 UTC

    You're only extracting the IP addresses, so let's find out how many you have.

    $ perl -wE'say 5.5*1024 / length("Nov 17 11:09:25 proxy02 kernel: OUTP +UT LOGIN= OUT=eth0 SRC=11.11.11.0 DST=192.168.3.1 LEN=1420 TOS=0x00 P +REC=0x00 TTL=64 ID=10523 DF PROTO=TCP SPT=3128 DPT=1921 WINDOW=16659 +RES=0x00 ACK URGP=0\n")' 29.0309278350515

    So you have about 30M IP addresses, many of which are duplicates.

    $ perl -MDevel::Size=total_size -wE'my %h; for (1..100) { ++$h{ pack " +C4", 1,2,3,$_ } } say total_size(\%h)/100' 47.56

    At a rate of roughly 50 bytes per IP, you'd need 30M * 50 bytes = 1.5GB just for the data. That's a lot, but it might be small enough to avoid getting fancy, especially since many are duplicates.

    use strict; use warnings; my $fn_in = 'internal.out'; my $fn_ex = 'external.out'; my @internals = ( [ pack('C4', 10,0,0,0 ), pack('C4', 255,0,0,0 ) ], # [ pack('C4', 10,10,0,0 ), pack('C4', 255,255,0,0 ) ], [ pack('C4', 192,3,3,0 ), pack('C4', 255,255,254,0 ) ], [ pack('C4', 192,168,0,0 ), pack('C4', 255,255,0,0 ) ], # [ pack('C4', 192,168,24,0 ), pack('C4', 255,255,248,0 ) ], ); sub is_internal { my $packed_ip = shift; for (@internals) { return 1 if $packed_ip & $_->[1] eq $_->[0]; } return 0; } sub extract { open(my $fh_in, '>', $fn_in) or die("Can't create file $fn_in: $!\n"); open(my $fh_ex, '>', $fn_ex) or die("Can't create file $fn_ex: $!\n"); my %seen; while (<>) { my $ip = /DST=(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/ or next; my $packed_ip = pack('C4', split(/\./, $ip)); next if $seen{$packed_ip}++; print { is_internal($packed_ip) ? $fh_in : $fh_ex } "$ip\n"; } undef %seen; # Free mem. } sub sort { my ($fn) = @_; my @packed_ips; { open(my $fh, '<', $fn) or die("Can't open file $fn: $!\n"); while (<$fh>) { chomp; my $packed_ip = pack('C4', split(/\./, $ip)); push @packed_ips = $packed_ip; } } @packed_ips = sort @packed_ips; { open(my $fh, '>', $fn) or die("Can't create file $fn: $!\n"); for (@packed_ips) { my $ip = join('.', unpack('C4', $packed_ip)); print("$ip\n"); } } } sub main { extract(); sort_file($fn_in); sort_file($fn_ex); } main();

    Using a trie instead of a hash would reduce memory usage, and it would provide the results in sorted order.

    By the way, two of your internal ranges are redundant with other ranges. I commented them out.

Re: Out of memory inefficient code?
by ikegami (Patriarch) on Feb 11, 2010 at 22:20 UTC
    Here, this offloads the memory intensive stuff to command line util sort. It can sort stuff that doesn't fit in memory, and it can remove duplicates in the process.
    use strict; use warnings; use Fcntl qw( SEEK_SET ); use File::Temp qw( ); use IPC::Open3 qw( ); my $fn_in = 'internal.out'; my $fn_ex = 'external.out'; my @internals = ( [ pack('C4', 10,0,0,0 ), pack('C4', 255,0,0,0 ) ], # [ pack('C4', 10,10,0,0 ), pack('C4', 255,255,0,0 ) ], [ pack('C4', 192,3,3,0 ), pack('C4', 255,255,254,0 ) ], [ pack('C4', 192,168,0,0 ), pack('C4', 255,255,0,0 ) ], # [ pack('C4', 192,168,24,0 ), pack('C4', 255,255,248,0 ) ], ); sub is_internal { my $packed_ip = shift; for (@internals) { return 1 if $packed_ip & $_->[1] eq $_->[0]; } return 0; } sub process_result { my ($child, $code) = @_; die("Can't collect child $child: $!\n") if $code == -1; my $s = $code & 127; die("Child $child was killed from signal $s\n"); my $e = $code >> 8; die("Child $child exited with code $e\n"); } sub sort_file { my ($fh, $fn) = @_; # open3 works better with globs open(local *TO_SORT, '<&', $fh) or die("Can't dup input handle: $!\n"); pipe(local *TO_CUT, local *FR_SORT) or die("Can't create pipe: $!\n"); open(local *FR_CUT, '>', $fn) or die("Can't create file \"$fn\": $!\n"); my $sort_pid = open3('<&TO_SORT', '>&FR_SORT', '>&STDERR', sort => ( -u => () )); my $cut_pid = open3('<&TO_CUT', '>&FR_CUT', '>&STDERR', cut => ( -f => '2-' )); process_result('sort', waitpid($sort_pid, 0)); process_result('cut', waitpid($cut_pid, 0)); } sub main { my $fh_in = tempfile(); my $fh_ex = tempfile(); while (<>) { my $ip = /DST=(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/ or next; my $packed_ip = pack('C4', split(/\./, $ip)); print { is_internal($packed_ip) ? $fh_in : $fh_ex } unpack('H8', $packed_ip), "\t", $ip, "\n"; } seek($fh_in, 0, SEEK_SET) or die("Can't seek temp file: $!\n"); seek($fh_ex, 0, SEEK_SET) or die("Can't seek temp file: $!\n"); sort_file($fh_in, $fn_in); sort_file($fh_ex, $fn_ex); } main();

    Untested.