in reply to Sorting out troubles with advanced-ish sort

I'm not seeeing the pattern in your output, but if you're looking to sort "by conversation", you'll need to come up with a function that can take a given record and give back a unique key for "a conversation". That's the first hurdle. After that, it'll be a matter of running the right sorts on that, and perhaps subsorts within the sort.

As an idea for a "conversation", presuming you mean "packets that go either way between host:port X and host:port Y", you could develop a signature for the conversation by sorting the host:port so that the lower host:port is always to the left, and then the signature is the concatenation of the two. For example, packets from 10.1:8000 to 10.2:5000 as well as packets from 10.2:5000 to 10.1:8000 would both be labeled as "10.1:8000-10.2:5000" in the signature of the conversation, by sorting the two hosts so that the lower host/port is to the left.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

  • Comment on Re: Sorting out troubles with advanced-ish sort

Replies are listed 'Best First'.
Re^2: Sorting out troubles with advanced-ish sort
by chargrill (Parson) on May 06, 2006 at 17:28 UTC

    Ok, so it looks like your nudge nudged me enough. The loop now looks like this:

    for my $line( map { $_->[2] } sort { $a->[0] cmp $b->[0] || $a->[1] <=> $b->[1] } map { my @vals = split /\s/, $_; my( $sourceserv, $sourceport ) = (split /:/, $vals +[0])[0,1]; my( $destserv, $destport ) = (split /:/, $vals[2] +)[0,1]; my( $low, $high ) = sort { $a cmp $b } ( "$sourceserv:$sourceport", "$destserv:$destpor +t" ); my( $key ) = $low . '-' . $high; [ $key, $sourceport , $_ ] } <DATA> ){ print $line; }

    And the output as desired:

    10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.5:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.10:8000 -> 10.10.10.5:1000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.5:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.5:1001 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.6:1000 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.6:1000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.6:1001 -> 10.10.10.10:8000 10.10.10.10:8000 -> 10.10.10.6:1001 10.10.10.7:1000 -> 10.10.10.10:8000

    Is this efficient? I don't know. Will it run "fast enough"? Also don't know. But I suppose I could let it chug along while I'm here on a Saturday, and if it's still running after I leave, so be it :) (Please don't ask how large the capture file is ;)

    Thanks much!



    --chargrill
    $,=42;for(34,0,-3,9,-11,11,-17,7,-5){$*.=pack'c'=>$,+=$_}for(reverse s +plit//=>$* ){$%++?$ %%2?push@C,$_,$":push@c,$_,$":(push@C,$_,$")&&push@c,$"}$C[$# +C]=$/;($#C >$#c)?($ c=\@C)&&($ C=\@c):($ c=\@c)&&($C=\@C);$%=$|;for(@$c){print$_^ +$$C[$%++]}
      I'd recommend cleaning the map up a bit:
      map { my ($src, $dst) = (split)[0,2]; my ($src_port) = $src =~ /:(\d+)/; ($src, $dst) = ($dst, $src) if $dst lt $src; [ "$src-$dst", $src_port, $_ ] }
      It just looks tidier to me. By the way, what about when you get an address like 10.10.10.14 as the source? That'll sort before 10.10.10.5, so perhaps you should convert the IPs into 4-byte sequences.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
        By the way, what about when you get an address like 10.10.10.14 as the source? That'll sort before 10.10.10.5, so perhaps you should convert the IPs into 4-byte sequences.

        When packed, the strings will be shorter and we could use the default { $a cmp $b } compare function. That makes the packed version much faster, and it would require less memory. Note the ingenious use of 0 or 1 to in lieu of $src.

        print map { substr($_, 13) } sort map { my ($src, $dst) = (split)[0, 2]; $src = pack('C4n', split(/[.:]/, $src)); $dst = pack('C4n', split(/[.:]/, $dst)); $src lt $dst ? "$src${dst}0$_" : "$dst${src}1$_" } <DATA>;

        Actually, adding $_ is redundant, since we can reconstruct it. The following would cut the memory usage in half:

        print map { my @f = unpack('C4nC4na', $_); my $src = "$f[0].$f[1].$f[2].$f[3]:$f[4]"; my $dst = "$f[5].$f[6].$f[7].$f[8]:$f[9]"; ($src, $dst) = ($dst, $src) if $f[10]; "$src -> $dst\n" } sort map { my ($src, $dst) = (split)[0, 2]; $src = pack('C4n', split(/[.:]/, $src)); $dst = pack('C4n', split(/[.:]/, $dst)); $src lt $dst ? "$src${dst}0" : "$dst${src}1" } <DATA>;

        Both of the above have been tested. They only work if the addresses are IPv4 addresses in dotted form.

        By the way, if you have memory problems, you could use an external sort tool as follows:

        { open(local *TEMP, '>', $sort_input); while (<DATA>) { my ($src, $dst) = (split)[0, 2]; $src = pack('C4n', split(/[.:]/, $src)); $dst = pack('C4n', split(/[.:]/, $dst)); my $data = $src lt $dst ? "$src${dst}0" : "$dst${src}1"; print TEMP (unpack('H*', $data), "\n"); } } ...[ call external sort tool ]... { open(local *TEMP, '<', $sort_output); while (<TEMP>) { chomp; my @f = unpack('C4nC4na', pack('H*', $_)); my $src = "$f[0].$f[1].$f[2].$f[3]:$f[4]"; my $dst = "$f[5].$f[6].$f[7].$f[8]:$f[9]"; ($src, $dst) = ($dst, $src) if $f[10]; print("$src -> $dst\n"); } }

        The convertion to hex is to avoid having newlines in your data.

      Since print operates on a list you could dispense with the

      for my $line ( map { ... } sort { ... } map { ... } <DATA>) { print $line; }

      and just do

      print map { ... } sort { ... } map { ... } <DATA>;

      without the need to assign each line before printing it. It looks a little easier to my eye.

      Cheers,

      JohnGG