http://qs1969.pair.com?node_id=25833

I was recently enjoying(sic) seeing 'script kiddies' trying to scan the ports on my home box. I wrote a script to parse out the packet DENY entries from my messages log file and realized that I should sort the list of IP's to make them easier to read.

Thoughts of complex sorting started to form in my head: split up the quad's and start comparing. I started to look for some elegant ways of doing this, when I realized how much more simple (and potentially more efficient) it would be to pack each IP into it's hex form. "This has to be a much, much faster way", I thought to myself.

So I did a search at Google and found this page written by Uri Guttman and Larry Rosler. They discuss this very thing - giving one example that sorts by breaking up the IP bytes and comparing them, and another example that packs the IP's into hex.

Here are their two examples, slightly paraphrased:

#naive: O(N*logN) my @sorted = sort { my @a = $a =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/; my @b = $b =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/; $a[0] <=> $b[0] || $a[1] <=> $b[1] || $a[2] <=> $b[2] || $a[3] <=> $b[3] } @unsorted; #packed: O(N*logN) my @sorted = sort { pack('C4' => $a =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) cmp pack('C4' => $b =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) } @unsorted;
What gets me is the fact that they have the same Big(O), but when benchmarked:
Benchmark: timing 5000 iterations of naive, packed... naive: 11 wallclock secs ( 9.91 usr + 0.32 sys = 10.23 CPU) packed: 8 wallclock secs ( 7.95 usr + 0.08 sys = 8.03 CPU)
there is an obvious difference.

To be honest, I really hated Big(O)analysis in college, and I think this is the very reason why. There is an obvious amount of better efficiency in the packed version, but in a constant way - this of course is eliminated when doing Big(O), making both algorithms 'equal'.

I am posting this mainly because I couldn't find any good IP sorting algorithms on PM, but I was also wondering how other Monks out there feel about the Big(O).

Jeff Anderson
tyring to be a better programmer every day . . .

UPDATE: Fri Aug 4 09:31:30 CDT 2000
Thanks to everyone for helping me understand the error in my ways - this whole time I was trying to compare apples to oranges, no wonder I had such a hard time understanding O(n). Big thanks to gryng for such a wonderful and heart-felt explaination. You rule.

Replies are listed 'Best First'.
At the risk of there being 1,000 replies by the time I finish mine:
by gryng (Hermit) on Aug 03, 2000 at 02:28 UTC
    Hiya, jeffa,

    When I first learned big-O notation, I had similar woes. However I eventually realized that the idea behind big-O was to show how a particular technique would scale, (and thus, also which parts of a techinique needed to be written with speed-sensitivity). This is in contrast to providing the answer to which of algorithm is faster.

    We could find some nice complex examples on PM (for instance ones involving hashes that I have generally just said "a little slower"), but your example is very simple, yet complex enough to show most of the following.

    In your example you lament that two algorithms with the same big-O notation produced fairly different speeds. As I mentioned before, big-O notation is not meant to compare different algorithms, in terms of their absolute speed. Rather it is supposed to say something about how they scale.

    For instance, since both algorithms have N*log N time, they should both scale in the same way. That means, if you double the number of items, the speed should be (2N*log 2N)/(N*log N) = 2 * (log(n) 2 + 1) (that is for large N, it will take closer to 2 times longer (sorry for the heavy use of math, also log(n) 2 is "log base N of 2")), or 2 about 2 times longer to execute. It also means that the ratio of their speed differences will be constant, so a 15% speed difference will stay at 15% no matter how large a dataset.

    This in itself tells us two things. The first is that we don't have to use a really large value of N (only large enough to cancel any "noise") to know how much our speed ups or downs will cost for larger N's. Also it means that the only thing that we need worry to work on is the cost of our "constant-op", in this case, that code between the {} for the sort. All of this can be important in more complex situations and even in this one!

    Say we hadn't heard of the idea of Schwartzian Transform. We could use big-O to give us a hint on what to look at in order to come to the same technique:

    For instance, lets look at your packed sort code:

    my @sorted = sort { pack('C4' => $a =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) cmp pack('C4' => $b =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) } @unsorted;

    Since big-O told us we have to make the stuff in the {} faster, and since it's only one comparison, we don't really have much choice but to try to remove those function calls to pack. Well how can we do that? We have to have them or we wouldn't be comparing properly. Aha! What if we already had them?? Then the {} wouldn't have to do it, and we should be faster!

    Well, at least, we hope so, because in fact our algorithm isn't really N*log N, it's 1 + N*log N. That is, we are paying a constant cost to send and prepare the data to our {} code.

    Luckily we just have to send the {} a pointer, not copy the whole list or some other fancy stuff, which would change the 1 into a N or worse!

    But opps, now that we have made {} short by removing the packs, we now have change a simple pointer hand off into "other fancy stuff". That is a N operational transform! (Well, N*M where M is the length of an item, but since that's constant....).

    So now our total cost is N+N*log N, slightly "worse" than the previous code, according to big-O. But as we have said, big-O doesn't say how fast an algorithm will be against another, but rather how they scale.

    What is hidden here is the hidden costs, the original code was: N*(x+y)*log N*(x+y) , and the new code is: (N*x + N*y*log N*y . We just move the cost X (the pack) out, leaving the cost Y (which is the cmp) still with in. X is much larger than Y, and since N*log N grows faster than N, we can see that by moving X out, we are saving more by paying for it only N times instead of N*log N times.

    Confusing? Sorry, but that's why big-O doesn't try to take these sorts of things into account. Instead you (more likely) just say that both the new and old code was N*log N, since this is the part of both code's big-O that grows the fastes. Then you can say that the newer code is faster, because the cost for each N is smaller (no pack). You must append this statement with the words "for large N", just to be safe though :) .

    Ok, so now you see why we don't worry about using big-O to tell which algorithm is faster (though you can use it as a tool to find out, using more detailed information). You can also now know you can use big-O notation to find where to concentrate your efforts when optimizing code.

    Now for the finale, we'll look at why the second code is faster than the first. The first thing we can do is, by using big-O, notice that since both have the same big-O notation, it must be that the code inside the {} is costing more than the others.

    So lets look at it:

    my @sorted = sort { my @a = $a =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/; my @b = $b =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/; $a[0] <=> $b[0] || $a[1] <=> $b[1] || $a[2] <=> $b[2] || $a[3] <=> $b[3] } @unsorted;
    and
    my @sorted = sort { pack('C4' => $a =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) cmp pack('C4' => $b =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/) } @unsorted;

    Now, the first thing to notice is that they both pay the same cost to translate $a and $b into a list of four numbers. However they store the result differently. One creates arrays @a and @b, the other sends the results directly to a function.

    This is one place for a speed increase, the first code can be made faster by saying:  my $aref = [$a=~/(\d+)\.(\d+)\.(\d+)\.(\d+)/]; Instead of allocating a new array and copying the contents, we now only allocate a reference and point to the already allocated anonymous array.

    The next place to look is at the cmp vs. the <=> and ||'s. Since both create all of the list at once (using the regex) the short circuit logic of both the cmp (between bytes) and the <=> and ||'s is only keeping you from having to make at most four more comparisons (and no data processing). So we can first conclude that the savings for this part will probably be small. Looking further we see that cmp is a direct perl statement, whereas <=> and || are several combined, the cmp is probably faster in this case (but again, not by much).

    The next thing is very subtle, you are using <=> in the first code, which forces strings to be converted to numbers before the are compared. This would mean that <=> vs a cmp would probably be slower (of course, simply moving to cmp would make the code incorrect). Of course these string conversions may or may not be equal to the cost of the pack() operation (not the cost of the call though).

    The last thing to look at is what is left, the array lookups, and the pack() call. The pack call has to be costly, but it's operation itself is fast (and probably written in C). Array lookups are pretty much as fast as you can get, so the first code wins with this.

    The conclusion is, you have to decide if memory allocation and deallocation is going to cost more than two function calls. I would personally go for the memory allocation being slower. This is because at worse, we have to ask the system itself for the memory, which can be very costly, even if it isn't likely. At best we have to ask perl for the memory, which is probably at least half the cost of a built in function like pack(). Also we then have to deallocate that same memory (with approximately the same cost).

    Now to test out some modifications:

    ... or not. My heavily load SMP 2.4.0-test kernel machine mixed with the Bechmark module seems to equal, not good system times (which may be part of my problems with random crashes) I just got back negative times after waiting about a minute on a benchmark that was only supposed to run for 1 cpu second :P . Anyway just take my word for it! lol

    Ciao,
    Gryn

    Does the hippo have the buddha nature?

RE: Sorting a list of IP addresses (aka Why I hate Big O)
by dws (Chancellor) on Aug 03, 2000 at 04:05 UTC
    It doesn't change the O'ness, but this looks like a job for the Orcish maneuver.
    my %cache = ();
    my @sorted = sort {
            $cache{$a} ||= pack('C4' => $a =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/);
            $cache{$b} ||= pack('C4' => $b =~ /(\d+)\.(\d+)\.(\d+)\.(\d+)/);
            $cache{$a} cmp $cache{$b};
    } @unsorted;
RE: Sorting a list of IP addresses (aka Why I hate Big O)
by eduardo (Curate) on Aug 03, 2000 at 01:32 UTC
    Alright. What did big O teach you... did it tell you how fast something was going to run? no. Big O was a measure of asymptotic complexity... the entire theory of Big O was: for N data items, proportionally, how many steps M will I have to take. What it was designed to tell you was that: If an algorithm of Big O of N took 10 seconds to do 10 items, when we increased the number if items to 20, it would probably take somewhere near 20 seconds.

    Comparing two algorithms that have a similar Big O, although it might seem like an intuitive thing to do, is really comparing apples and oranges. Remember, and this should be your mantra: "we are not comparing runtimes, but asymptotic complexity." Let us say that you have two algorithms with a same Big O, let us say of N, and data for them to utilize for size M. Let us say that algorithm 1 takes X time to accomplish it's task, and algorithm 2 takes Y time to accomplish it's task. All that Big O is telling you is that for M*2 datum, algorithm 1 will take X*2 and algorithm 2 will take Y*2! Their rate of CHANGE is what you are looking for (god, is that the 1st derivative?) not their actual values.

    Eek... i think that's right... anyone wanna jump in on this one with me?

      Yes, i will jump in with you, enthusiastically. Thank you for the cogent explanation.

      However, I wouldn't necessarily say the big O notation is a measure of the rate of change. (it is, sort of, but i don't think that's the best way to think of it).

      Another way of thinking of it is this: Take two algorithms, algorithm A, which is O(N*log N), and algorithm B, which is O(N^2). Now let's say that the execution time for Algorithm A is 100 seconds when N=1000, and that the execution time for Algorithm B is 1 second when N=1000. At first glance, B might appear faster than A, and it certainly is for N=1000. However, what Big O tells us is that no matter what execution times we measure for A and B at a given value of N, A will be faster than B for some arbitrarily large N.

      This, is essentially what the theorem behind big O states: That an algorithm with a 'smaller' big O will run faster than an algorithm with a 'larger' big O, for all N > X, X finite but arbitrarily large.

      B may run 10^100 times faster than A for N = 10^1000, but since A has a 'smaller' big O, there is some N such that A will run faster than B.

      Okay, i'll stop rambling.
      -Mark

RE: Sorting a list of IP addresses (aka Why I hate Big O)
by DrManhattan (Chaplain) on Aug 03, 2000 at 16:25 UTC

    Read further down in the Guttman/Rosler paper to where they discuss the "packed-default" sort:

    @out = map substr($_, 4) => sort map pack('C4' => /(\d+)\.(\d+)\.(\d+)\.(\d+)/) . $_ => @in;

    The idea is to pack the entire list once before sorting it then get each address back out with a substr afterwards. Using this technique, you only make N packs and N substrs, instead of N*log(N) packs. It also has the advantage of using sort without calling a subroutine for each comparison which is a huge speedup.

    -Matt

RE: Sorting a list of IP addresses (aka Why I hate Big O)
by ferrency (Deacon) on Aug 03, 2000 at 00:55 UTC
    I think O() analysis definitely has its uses. It's good for determining "all else being equal, which algorithm will be faster?" It doesn't say anything about implementations. I'm pretty sure I could write a poor implementation of an O(n log n) algorithm that runs slower than a good implementation of an O(n^2) algorithm (for sufficiently small datasets. How small? Small enough so the O(n log n) implementation is slower :) But in general, the O(n log n) solution will be faster, as your data set increases in size, and as you tweak the "all else" in your implementation to get it as close to "equal" as possible.

    I wasn't a huge fan of O() notation in college either, once we started analyzing the more complicated algorithms. Not because it was inaccurate, but because it was difficult, and "no one uses those algorithms anyway." However, now I'm glad I did it, because after doing all the hard analysis, the easy analysis which I do use regularly (though not explicitly) comes intuitively.

    Alan

RE: Sorting a list of IP addresses (aka Why I hate Big O)
by plaid (Chaplain) on Aug 03, 2000 at 01:45 UTC
    The Big-O notation doesn't work in such a simple way as taking the complexity of two algorithms and comparing them directly like that. Big-O notation relies heavily on exactly what 'N' is. Just because two algorithms are O(N) or O(N^2) doesn't mean a thing, because in one algorithm, N could be a large, time-consuming mathematical calculation, and in the other N could be a simple regular expression.

    The power of Big-O notation comes with being able to roughly be able to predict how an algorithm will act on different sizes of datasets, and taking that information and tailoring your algorithm to get the best performance based on the fastest running time of the average-sized set of data. If one algorithm is O(N), and the other is O(N^2), the latter may be a better choice in some cases, if the former has a much larger N, and you can ensure that there won't be too much data to negate the smaller N.

    Basically, it all boils down to the necessity of doing much more testing than you did. The best efficiency takes many steps to reach. If you did more benchmarks with different numbers of IP addresses, different complexity algorithms, and things like that, you'd start to see how the Big-O notation would help you predict future tests.

      There is a very interesting point about Perl and algorithmic efficiency. Perl makes it very natural for people to use highly optimized pattern recognition algorithms in the regular expression engine, and a good hashing algorithm.

      C, while much faster, makes it far more natural to repeatedly scan lists rather than store a key in a hash and just check exists.

      The result is that the same person in many circumsances will find that their Perl code not only is easier to write and read, but outperforms what they would have written in C. Sure, if they wrote their C carefully they would beat Perl every time. But natural Perl idioms tend to be better algorithmically than the natural C approaches, and algorithmic efficiency trumps raw performance.

      (This does not hold true for all programmers, YMMV, etc. But it is surprising how often it is true.)

RE: Sorting a list of IP addresses (aka Why I hate Big O)
by Maqs (Deacon) on Aug 03, 2000 at 22:01 UTC
    I'd try another solution :)
    why not to read all IP adresses, convert them to binary,
    for each part do:
    join "", unpack "B8", pack "I" , $value;

    , getting rid of '.' (couse now they play no role)
    write these adresses to a txt file and then sort them execing external 'sort'?
    Then convert them back to decimal , '.'-separated.. and voi la...
    I havent tested that, but it seems to me it'll be quicker.

    /Maqs.
Re: Sorting a list of IP addresses (aka Why I hate Big O)
by Anonymous Monk on Dec 19, 2000 at 16:43 UTC