in reply to Re^3: Some code optimization
in thread Some code optimization

Let's look at a single iteration and define n as the number of genes. A gene is a start point + length, i.e. n is the number of elements pushed in the creation of $simulation_h (i.e. n=70,000 in the example).

The main loop is indeed a nested loop, but it simply gets a single gene at a time from the hash, hence this double loop itself takes O(n) (before doing anything).

Now, let's look at what is done inside the loop. We call two "problematic" subroutines, which in turn, call some other subroutines. The "loops" you see in the other subroutines are used to deal with the case when genes are split (start too close to the "end" of a circular the chromosome, thus have to be represented by two ranges instead of one). A range can therefore be either a singleton (one array element) or "split" (two array elements).

So, yes, there are loops, but they sure don't add another factor of n to the complexity, since these loops are bounded by a small constant -- each loop is iterated for maximum of 2 iterations -- so even the double loop that checks a pair of ranges against each other (each range is "split" in the worse case scenario) is iterated for maximum of 4 iterations.

To conclude, the complexity is actually O(n) per iteration (again, n is the number of genes).

Replies are listed 'Best First'.
Re^5: Some code optimization
by davido (Cardinal) on Jun 18, 2010 at 08:52 UTC

    So if the outer loop, 'n' is the only 'big loop', try putting a broadly scoped counter inside each of the 'little loops' (a unique counter for each), and prove to yourself that they're not running away by mistake. ...just one idea. Another is to check how many times you're calling that grep and the sort, and for how many elements. Each of those is an implicit loop.

    The idea is to test your theory that the complexity is O(n), and in testing, you may discover that something is not what it seems. I'm not suggesting that your theory is wrong, but that maybe the implementation doesn't match the theory.

    If I were hunting down the problem myself, I would be very interested in how many iterations each of the loops is running through. Using a counter variable for each loop, and scoping it to a broad scope so it doesn't reset, and then checking it at the end of the run, may smoke out the mole.


    Dave

      I Added a global counter in the most inner loop in intersect_legal_ranges. Counter value at end is, as expected, equal to the number of iterations * n + some small overhead for the ranges that were created as "split" by random. E.g. for 10 iterations and 70,000 genes I got 702,810 (i.e. 70,281 per iteration when n=70,000).

      Also note that the slowdown occurs even when the second problematic subroutine (is_contained) is not called at all (the next before it on). The first problematic subroutine (gene_to_legal_range) does not use loops at all and is very simple (I hope you agree)