chargrill has asked for the wisdom of the Perl Monks concerning the following question:

I might have an XY Problem, so I'll explain the X part first.

X: I have a packet capture log and I'm trying to go through it to find anomalous traffic patterns. I've already figured out how to parse the log file, and store it in a data structure (maybe not important here, but I think it would be called an AoHoA) that seems to make sense to me for my purposes. I'm specifically interested in seeing the start of a TCP conversation from server foo to server bar, along with the reply back from server bar to server foo.

Y: I would like to sort the output and group related conversations together. I've created a Schwartzian transform in a for loop to go through my data structure, but I seem to have a problem implementing the sort algorithm.

This example code is not quite like my actual code, but produces results the same way. Given the following code:

#!/usr/bin/perl use strict; use warnings; for my $line( map { $_->[2] } sort { $a->[0] <=> $b->[0] || $a->[1] <=> $b->[1] } map { my @vals = split /\s/, $_; my $sourceport = ( split /:/, $vals[0] )[1]; my $destport = ( split /:/, $vals[2] )[1]; [ $sourceport, $destport , $_ ] } <DATA> ){ print $line; } __DATA__ 10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.7:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.6:1000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1001 10.10.10.10:8000 -> 10.10.10.6:1001

I get the following output:

Update: To make it more plain, this is sorting by having all traffic from server foo (sorted by source port) first, then all replies back from server bar (sorted by destination port).

10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.7:1000 -> 10.10.10.10:8000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.6:1000 10.10.10.10:8000 -> 10.10.10.5:1001 10.10.10.10:8000 -> 10.10.10.6:1001

I would like the output instead to be sorted like this:

Update: To make it more plain, this is sorting by having traffic from server foo (sorted by source port) immediately followed by the reply from server bar before displaying the next 'conversation' from server foo (based on the next source port).

# space added for emphasis, don't need it in the output 10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.6:1000 10.10.10.7:1000 -> 10.10.10.10:8000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1001 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.6:1001

I've tried various incarnations of the intermediary sort, using different || and && operators, trying to $a->[0] <=> $b->[1], but all to no avail. I'm sure that there's something relatively simple that I'm just not grokking, or leaving out. So if anyone could gently nudge me in the right direction, I would greatly appreciate it.



--chargrill
$,=42;for(34,0,-3,9,-11,11,-17,7,-5){$*.=pack'c'=>$,+=$_}for(reverse s +plit//=>$* ){$%++?$ %%2?push@C,$_,$":push@c,$_,$":(push@C,$_,$")&&push@c,$"}$C[$# +C]=$/;($#C >$#c)?($ c=\@C)&&($ C=\@c):($ c=\@c)&&($C=\@C);$%=$|;for(@$c){print$_^ +$$C[$%++]}

Replies are listed 'Best First'.
Re: Sorting out troubles with advanced-ish sort
by merlyn (Sage) on May 06, 2006 at 16:47 UTC
    I'm not seeeing the pattern in your output, but if you're looking to sort "by conversation", you'll need to come up with a function that can take a given record and give back a unique key for "a conversation". That's the first hurdle. After that, it'll be a matter of running the right sorts on that, and perhaps subsorts within the sort.

    As an idea for a "conversation", presuming you mean "packets that go either way between host:port X and host:port Y", you could develop a signature for the conversation by sorting the host:port so that the lower host:port is always to the left, and then the signature is the concatenation of the two. For example, packets from 10.1:8000 to 10.2:5000 as well as packets from 10.2:5000 to 10.1:8000 would both be labeled as "10.1:8000-10.2:5000" in the signature of the conversation, by sorting the two hosts so that the lower host/port is to the left.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Ok, so it looks like your nudge nudged me enough. The loop now looks like this:

      for my $line( map { $_->[2] } sort { $a->[0] cmp $b->[0] || $a->[1] <=> $b->[1] } map { my @vals = split /\s/, $_; my( $sourceserv, $sourceport ) = (split /:/, $vals +[0])[0,1]; my( $destserv, $destport ) = (split /:/, $vals[2] +)[0,1]; my( $low, $high ) = sort { $a cmp $b } ( "$sourceserv:$sourceport", "$destserv:$destpor +t" ); my( $key ) = $low . '-' . $high; [ $key, $sourceport , $_ ] } <DATA> ){ print $line; }

      And the output as desired:

      10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1001 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.6:1000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.6:1001 10.10.10.7:1000 -> 10.10.10.10:8000

      Is this efficient? I don't know. Will it run "fast enough"? Also don't know. But I suppose I could let it chug along while I'm here on a Saturday, and if it's still running after I leave, so be it :) (Please don't ask how large the capture file is ;)

      Thanks much!



      --chargrill
      $,=42;for(34,0,-3,9,-11,11,-17,7,-5){$*.=pack'c'=>$,+=$_}for(reverse s +plit//=>$* ){$%++?$ %%2?push@C,$_,$":push@c,$_,$":(push@C,$_,$")&&push@c,$"}$C[$# +C]=$/;($#C >$#c)?($ c=\@C)&&($ C=\@c):($ c=\@c)&&($C=\@C);$%=$|;for(@$c){print$_^ +$$C[$%++]}
        I'd recommend cleaning the map up a bit:
        map { my ($src, $dst) = (split)[0,2]; my ($src_port) = $src =~ /:(\d+)/; ($src, $dst) = ($dst, $src) if $dst lt $src; [ "$src-$dst", $src_port, $_ ] }
        It just looks tidier to me. By the way, what about when you get an address like 10.10.10.14 as the source? That'll sort before 10.10.10.5, so perhaps you should convert the IPs into 4-byte sequences.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
        Since print operates on a list you could dispense with the

        for my $line ( map { ... } sort { ... } map { ... } <DATA>) { print $line; }

        and just do

        print map { ... } sort { ... } map { ... } <DATA>;

        without the need to assign each line before printing it. It looks a little easier to my eye.

        Cheers,

        JohnGG

Re: Sorting out troubles with advanced-ish sort
by salva (Canon) on May 06, 2006 at 20:27 UTC
    The description of the problem (specially the X part) is too vague, but it seems to me that sorting is not the best way to solve it.

    To separate the packets by connection, you can use a hash of arrays (untested):

    my %conn; while(<DATA>) { my ($src_ip, $src_port, $dest_ip, $dest_port, @more) = /^([\d\.]+):(\d+) -> ([\d\.]+):(\d+) ...$/; my $conn = $conn{join('-', $src_ip, $src_port, $dest_port, $dest_ip) +} ||= []; # $. can be used as a sequence number: push @{$conn}, [$., @more] } # and then analyze the sequence of packets for every connection: for my $key (keys %conn) { my $conn = $conn{$key}; my $conn_back = $conn{join('-', reverse split /-/, $key)} || []; ...
    Using the sequence numbers taken from $. you should be able to analyze the flow of packets combining the entries in $conn and $conn_back.