in reply to Re: Can I speed this up? (repetitively scanning ranges in a large array)
in thread Can I speed this up? (repetitively scanning ranges in a large array)

Obviously. The only reason I'm using coordinates that start from 1, not zero, is to be consistent with the input and output formats I'm using.

These are biological data, and the unfortunate convention in all biological databases I'm familiar with is to start counting from 1. The first position in any genome is 1. If I used zeros, I would have to remember converting between the two systems each time I input/output and from experience, I quickly forget doing that...

  • Comment on Re^2: Can I speed this up? (repetitively scanning ranges in a large array)

Replies are listed 'Best First'.
Re^3: Can I speed this up? (repetitively scanning ranges in a large array)
by ikegami (Patriarch) on Nov 02, 2010 at 19:07 UTC

    I would have to remember converting between the two systems each time I input/output and from experience, I quickly forget doing that...

    You say it's obvious, then you keep talking as if a circle has a start.

    No conversion is necessary. Just start at index one of the array for the item labeled 1, then keep going for 100 elements, which is going to end you at element zero of the array.

    for (1..100) { print $result[$_ % 100]; }
      That's possible, but a bit confusing. In all biological databases, a circular genome of length 100 does have coordinate 100. It doesn't have coordinate zero. So, when I print a range that spans to the end of the genome, I must print it as x..100, not x..0, as this is the convention. So this representation might have some benefits, but it also brings the overhead of remembering switch back to 'biological' coordinates when you input/output back...

      Also note not all genomes are circular. What would you do about those? If you put the first coordinate in the first position of the array (arr[0]), you will surely have to -1 anytime you output any coordinate. If you start from arr1 you do what I currently do, but now you treat circular genomes and linear genomes differently anytime you print or even calculate their length (if the genome is circular scalar(@arr) == genome size, but if it's linear scalar(@arr) == genome size + 1). Confusing...

      Anyway, I must admit I'm not sure why are we focusing on this... that's really not the issue. Finally, note that in biology circles do have a start :) For each circular genome, a certain point was selected as '1'. This choice is actually not completely arbitrary - there are some rules for deciding where to call this landmark. Once a genome has been sequenced and published it has one and only '1' and any reference to this genome will be relative to this landmark. Just for general knowledge...

        does have coordinate 100.

        Wait, are your suggesting to use the 2nd to 101st index for coordinates 1 to 100? And here I thought the alternative would have been to to use the 1st index to the 100th index for coordinates 1 to 100. That's even worse. You can't use the mod at all even though the situation calls for it.

        not all genomes are circular. What would you do about those?

        It's not relevant, but I'd use the first index of the array for the first element of the genome. I'd personally use that for circular genomes too. What I did was propose a compromise.

        (if the genome is circular scalar(@arr) == genome size, but if it's linear scalar(@arr) == genome size + 1). Confusing...

        Exactly. The size of the genome (@genome) should be the size of the genome.

        Anyway, I must admit I'm not sure why are we focusing on this.

        To point out how silly it is to be focusing on such.

        But it is pertinent, as choosing a bad coordinate system will just add operations, slowing things down.