-=Markus=- has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm a Perl newbie and would have a question about data merging and finding reverse IP pairs. I have a tab separated list of network traffic containing source IP, destination IP and bytes transferred between the peers as follows:

Source IP Destination IP Bytes 10.0.0.24 93.188.134.219 32684 120.137.205.48 10.0.0.171 258 10.0.0.26 84.124.185.220 432 10.0.0.10 84.31.180.236 1476 84.31.180.236 10.0.0.10 4273

I would need to aggregate the data (bytes) for each session (= source/destination IP pair = destination/source IP pair). In the above example data the last two lines should be aggregated as follows:

10.0.0.10 84.31.180.236 1476 84.31.180.236 10.0.0.10 4273

=>

10.0.0.10 84.31.180.236 5749

The order of the IPs doesn't matter. Finally the complete list of all data should be printed. Based on the above example the source data should finally be shown as:

10.0.0.24 93.188.134.219 32684 120.137.205.48 10.0.0.171 258 10.0.0.26 84.124.185.220 432 10.0.0.10 84.31.180.236 5749

I've created the following solution:

#!/usr/bin/perl use strict; my @lines; open(D, $ARGV[0]) || die("Could not open file!\nUsage: $0 file "); @lines = <D>; close(D); my %count; foreach (@lines) { next if /^#|^(\s)*$/; chomp; my ($ipa, $ipb, $bytes) = split /\t\s?/; if((grep /$ipb/, %count) && (grep /$ipa/, (%{$count{$ipb}}))) { $count{$ipb}{$ipa}+=$bytes; } else { $count{$ipa}{$ipb}+=$bytes; } } foreach my $key(keys %count){ foreach my $k(keys %{$count{$key}}){ print "${key}\t$k\t$count{$key}->{$k}\n"; } }

That works well for a small amount of data (for few thousands of lines) but is basically unusable for vast amount of data (I have over 77M lines to process).

I have been struggling to find a proper solution for the issue for the last three days but haven't progressed much. I would highly appreciate any help on this one. Thanks in advance! :)

Br, -=Markus=-

Ps.

How the same (aggregation of all data columns) can be done for data containing multiple columns? Like:

Source IP Destination IP Bytes Packets Flows 10.0.0.10 84.31.180.236 1476 241 22 84.31.180.236 10.0.0.10 4273 15 3 => 10.0.0.10 84.31.180.236 5749 256 25

Replies are listed 'Best First'.
Re: How to merge data in IP address pairs
by tobyink (Canon) on May 27, 2012 at 13:55 UTC

    The grepping is slowing you down unnecessarily. Just use sort to make sure that you always refer to the IP addresses in a predictable order.

    my $filename = shift @ARGV; die "Usage: $0 FILENAME" unless defined $filename; open my $fh, '<', $ARGV[0] or die "Could not open file '$filename': $!"; my %count; INPUT: while (<$fh>) { chomp; my ($ip1, $ip2, $bytes) = split /\s+/; ($ip1, $ip2) = sort ($ip1, $ip2); $count{$ip1, $ip2} += $bytes; } OUTPUT: { local $, = "\t"; local $\ = "\n"; foreach (keys %count) { my ($ip1, $ip2) = split $;, $_; print $ip1, $ip2, $count{$_}; } }

    Adding extra columns is not much different.

    my $filename = shift @ARGV; die "Usage: $0 FILENAME" unless defined $filename; open my $fh, '<', $ARGV[0] or die "Could not open file '$filename': $!"; my %count; INPUT: while (<$fh>) { chomp; my ($ip1, $ip2, @data) = split /\s+/; ($ip1, $ip2) = sort ($ip1, $ip2); $count{$ip1, $ip2}[$_] += $data[$_] for 0 .. $#data; } OUTPUT: { local $, = "\t"; local $\ = "\n"; foreach (keys %count) { my ($ip1, $ip2) = split $;, $_; print $ip1, $ip2, @{ $count{$_} }; } }
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      "The grepping is slowing you down unnecessarily"

      By the way, it's also plain weird. If you want to check for the existence of hash keys, then there's the exists keyword:

      if (exists $count{$ipb} and exists $count{$ipb}{$ipa})
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      Hi Tobyink,

      Thank you a million - works like a charm! You totally saved my week! :)

      Kind regards,

      -=Markus=-