How to merge data in IP address pairs

-=Markus=- has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm a Perl newbie and would have a question about data merging and finding reverse IP pairs. I have a tab separated list of network traffic containing source IP, destination IP and bytes transferred between the peers as follows:

Source IP            Destination IP    Bytes
10.0.0.24            93.188.134.219    32684
120.137.205.48        10.0.0.171        258
10.0.0.26            84.124.185.220    432
10.0.0.10            84.31.180.236    1476
84.31.180.236        10.0.0.10        4273
[download]

I would need to aggregate the data (bytes) for each session (= source/destination IP pair = destination/source IP pair). In the above example data the last two lines should be aggregated as follows:

10.0.0.10            84.31.180.236        1476
84.31.180.236        10.0.0.10            4273
[download]

10.0.0.10            84.31.180.236        5749
[download]

The order of the IPs doesn't matter. Finally the complete list of all data should be printed. Based on the above example the source data should finally be shown as:

10.0.0.24            93.188.134.219    32684
120.137.205.48        10.0.0.171        258
10.0.0.26            84.124.185.220    432
10.0.0.10            84.31.180.236    5749
[download]

I've created the following solution:

#!/usr/bin/perl

use strict;

my @lines;
open(D, $ARGV[0]) || die("Could not open file!\nUsage: $0 file ");
@lines = <D>;
close(D);

my %count;
foreach (@lines) {
        next if /^#|^(\s)*$/;
        chomp;
        my ($ipa, $ipb, $bytes) = split /\t\s?/;
        if((grep /$ipb/, %count) && (grep /$ipa/, (%{$count{$ipb}})))
                {
                $count{$ipb}{$ipa}+=$bytes;
                }
        else
                {
                $count{$ipa}{$ipb}+=$bytes;
                }
        }

foreach my $key(keys %count){
        foreach my $k(keys %{$count{$key}}){
        print "${key}\t$k\t$count{$key}->{$k}\n";
        }
}
[download]

That works well for a small amount of data (for few thousands of lines) but is basically unusable for vast amount of data (I have over 77M lines to process).

I have been struggling to find a proper solution for the issue for the last three days but haven't progressed much. I would highly appreciate any help on this one. Thanks in advance! :)

Br, -=Markus=-

Ps.

How the same (aggregation of all data columns) can be done for data containing multiple columns? Like:

Source IP            Destination IP        Bytes    Packets    Flows
10.0.0.10            84.31.180.236        1476        241        22
84.31.180.236        10.0.0.10            4273        15        3
=>
10.0.0.10            84.31.180.236        5749        256        25
[download]

Comment on How to merge data in IP address pairs Select or Download Code

Replies are listed 'Best First'.
Re: How to merge data in IP address pairs by tobyink (Canon) on May 27, 2012 at 13:55 UTC
The grepping is slowing you down unnecessarily. Just use `sort` to make sure that you always refer to the IP addresses in a predictable order. `my $filename = shift @ARGV; die "Usage: $0 FILENAME" unless defined $filename; open my $fh, '<', $ARGV[0] or die "Could not open file '$filename': $!"; my %count; INPUT: while (<$fh>) { chomp; my ($ip1, $ip2, $bytes) = split /\s+/; ($ip1, $ip2) = sort ($ip1, $ip2); $count{$ip1, $ip2} += $bytes; } OUTPUT: { local $, = "\t"; local $\ = "\n"; foreach (keys %count) { my ($ip1, $ip2) = split $;, $_; print $ip1, $ip2, $count{$_}; } }` [download] Adding extra columns is not much different. `my $filename = shift @ARGV; die "Usage: $0 FILENAME" unless defined $filename; open my $fh, '<', $ARGV[0] or die "Could not open file '$filename': $!"; my %count; INPUT: while (<$fh>) { chomp; my ($ip1, $ip2, @data) = split /\s+/; ($ip1, $ip2) = sort ($ip1, $ip2); $count{$ip1, $ip2}[$_] += $data[$_] for 0 .. $#data; } OUTPUT: { local $, = "\t"; local $\ = "\n"; foreach (keys %count) { my ($ip1, $ip2) = split $;, $_; print $ip1, $ip2, @{ $count{$_} }; } }` [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]
Re^2: How to merge data in IP address pairs by tobyink (Canon) on May 27, 2012 at 14:37 UTC
"The grepping is slowing you down unnecessarily" By the way, it's also plain weird. If you want to check for the existence of hash keys, then there's the `exists` keyword: `if (exists $count{$ipb} and exists $count{$ipb}{$ipa})` [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]
Re^2: How to merge data in IP address pairs by -=Markus=- (Initiate) on May 27, 2012 at 14:54 UTC
Hi Tobyink, Thank you a million - works like a charm! You totally saved my week! :) Kind regards, `-=Markus=-`	[reply] [d/l]