in reply to Re: *Fastest* way to print a hash sorted by value
in thread *Fastest* way to print a hash sorted by value

I'm grabbing the data from our firewall logs. I figure there isn't a way around doing one pass through the logs before I can find process them.
I.e. How many times does IP 10.0.0.10 goto 192.168.0.10?
I still have to figure what the theorotical max of processing all of the data is (i.e. Just read file and do nothing).
I'm starting to think there isn't much room for me to speed my script up.
  • Comment on Re: Re: *Fastest* way to print a hash sorted by value

Replies are listed 'Best First'.
Re^3: *Fastest* way to print a hash sorted by value (less Perl)
by tye (Sage) on Aug 08, 2003 at 23:13 UTC

    You might try using external programs that are optimized to deal with such situations efficiently. I wish standard 'sort' knew how to include a count when doing 'sort -u' (like 'uniq -c' does). But it might be worth trying to have your Perl script just extract the IP addresses and output "10.0.0.10 192.168.0.10\n" via a pipe to "sort | uniq -c | sort +n | head -$N".

                    - tye

      it may be overkill or underkill but if the statistics that are needed are contained in the tcp/udp header (srcip, dstip, srcport, dstport, length, flags, ...) say for counting the number of connections/packets/data/protocol between hosts then use flow-tools and Cflow.

      parse the firewall logs with Perl using the Cflow module to convert to netflow format, it holds data you find in the IP headers and some more. then use the flow-tools to process them. there is a learning curve, but it's worth it.

      the simplest Perl version of say counting the number of packets and bytes transfered between a given pair of hosts over enough time/data so that the script takes 1 minute to run. well, the equivalent flow-tools command will do the same in 2 seconds. it does nice text-based reports, scan detection, powerfull filtering and there's excellent support on the mailing list.

      if you're counting based on the data in the packet like specific intrusion attempt types then flow-tools won't do much good. if you have lots of header type data flow-tools rocks.

Re: Re: Re: *Fastest* way to print a hash sorted by value
by esh (Pilgrim) on Aug 08, 2003 at 23:23 UTC
    I know this is a Perl forum, but sometimes Perl by itself is not the best tool for the job. You might consider testing the speed of using another program like sort(1) on the file before processing it with Perl (or just run sort from your Perl program).

    Check out the helpful advice in this thread: Fast/Efficient Sort for Large Files

    Then, you can read in the lines until you have the top N values and you're done.

    If your log files are not easily sortable using sort(1) because of strange formatting and delimiters (like web server access logs), you may pre-process them (with Perl, of course) so that they have nice column separators for sort to recognize.

Re: Re: Re: *Fastest* way to print a hash sorted by value
by sfink (Deacon) on Aug 09, 2003 at 05:11 UTC
    At least those can be compressed well. Let's see... something like:
    my %occurrences; while(<>) { # Extract the IP addresses my ($from_ip, $to_ip) = /...some regex.../ or next; # Losslessly compress them into an 8-byte key my $key = pack("C8", split(/\./, "$from_ip.$to_ip")); $occurrences{$key}++; } # Now convert the counts to small, easily sorted things my @data; while (my ($key, $count) = each %occurrences) { # Use network byte-order (big-endian) because it # will sort alphabetically in numeric order push @data, pack("N", $count) . $key; # To save space, shrink the hash. Note that it is # normally unsafe to modify the set of entries in # a hash while iterating over them, but deleting # the current entry returned by 'each' is explicitly # allowed (perldoc -f each). delete $occurrences{$key}; } # Now we have a simple array of (packed) counts and IP # pairs. The easiest thing to implement would then be: return (reverse sort @data)[0..$N-1]; # A trickier implementation that would avoid the superlinear # expense of sorting would be to grab a random element and # partition the array into a group of elements that are # greater than that element, and a group of those that are # smaller. As soon as the size of the group that is greater # exceeds $N, throw out the other group and just finish # creating the group of larger elements. If you're unlucky # and your "larger" group never gets up to $N elements, # start over again with a smaller pivot element. # # Repeat until you are left with a more manageable size, # and then just do the sort. while (@data > 10000) { # Assuming $N < 10000 my (@big, @small); # The first element may be "random" enough, since it's # in perl's hash order. But you really probably ought # to pick a better one, just in case. my $pivot = shift(@data); while (@data) { my $elt = shift(@data); ($elt gt $pivot) ? push(@big,$elt) : push(@small,$elt); if (@big >= $N) { while (@data) { $elt = shift(@data); push(@big, $elt) if $elt gt $pivot; } last; } } if (@big >= $N) { @data = @big; } else { # Put the pivot at the end so we don't reuse it! push @data, @big, @small, $pivot; } }
    Unpacking the result is left as an exercise for the reader. I didn't test this all that much, but it seems to more or less work, and the idea is the main thing.