in reply to Re^3: Can I speed this up? (repetitively scanning ranges in a large array)
in thread Can I speed this up? (repetitively scanning ranges in a large array)

That's possible, but a bit confusing. In all biological databases, a circular genome of length 100 does have coordinate 100. It doesn't have coordinate zero. So, when I print a range that spans to the end of the genome, I must print it as x..100, not x..0, as this is the convention. So this representation might have some benefits, but it also brings the overhead of remembering switch back to 'biological' coordinates when you input/output back...

Also note not all genomes are circular. What would you do about those? If you put the first coordinate in the first position of the array (arr[0]), you will surely have to -1 anytime you output any coordinate. If you start from arr1 you do what I currently do, but now you treat circular genomes and linear genomes differently anytime you print or even calculate their length (if the genome is circular scalar(@arr) == genome size, but if it's linear scalar(@arr) == genome size + 1). Confusing...

Anyway, I must admit I'm not sure why are we focusing on this... that's really not the issue. Finally, note that in biology circles do have a start :) For each circular genome, a certain point was selected as '1'. This choice is actually not completely arbitrary - there are some rules for deciding where to call this landmark. Once a genome has been sequenced and published it has one and only '1' and any reference to this genome will be relative to this landmark. Just for general knowledge...

  • Comment on Re^4: Can I speed this up? (repetitively scanning ranges in a large array)

Replies are listed 'Best First'.
Re^5: Can I speed this up? (repetitively scanning ranges in a large array)
by ikegami (Patriarch) on Nov 02, 2010 at 22:52 UTC

    does have coordinate 100.

    Wait, are your suggesting to use the 2nd to 101st index for coordinates 1 to 100? And here I thought the alternative would have been to to use the 1st index to the 100th index for coordinates 1 to 100. That's even worse. You can't use the mod at all even though the situation calls for it.

    not all genomes are circular. What would you do about those?

    It's not relevant, but I'd use the first index of the array for the first element of the genome. I'd personally use that for circular genomes too. What I did was propose a compromise.

    (if the genome is circular scalar(@arr) == genome size, but if it's linear scalar(@arr) == genome size + 1). Confusing...

    Exactly. The size of the genome (@genome) should be the size of the genome.

    Anyway, I must admit I'm not sure why are we focusing on this.

    To point out how silly it is to be focusing on such.

    But it is pertinent, as choosing a bad coordinate system will just add operations, slowing things down.

      Of course I can use the modulo operator (and I sure do). Anyway, this is a matter of taste. I originally used something like you suggested but later found it more convenient to have $genome$i always refer to genomic position $i, regardless if the genome is linear or circular (I have both kinds).

      I don't think I pay any performance price and it's more straightforward for me to be consistent with the biological conventions.

        Of course I can use the modulo operator (and I sure do).

        I suppose you're right. It'll just look like

        (x - 1) % size + 1
        instead of
        x % size

        I don't think I pay any performance price

        Well, there are more ops, but the real cost is the complexity and clarity due to working with off-by-one numbers (first element of genome in second element of array).