in reply to Re^3: Sorting Gigabytes of Strings Without Storing Them (bytes)
in thread Sorting Gigabytes of Strings Without Storing Them

Hm. I can't get this to work. There's a couple of obvious things:

  1. %#quad; => $#quad;
  2. /ge

But the main problem is my long standing trouble with your NestedLoops. You're passing in an anon array of four references to an array containing ACGT: [ ( \@bases ) x 4 ], but that results in Not an ARRAY reference... from the @$_ within the map, and I can't see how to put that right?

Also, for a proper comparison, you;d need a 'decode' operation.

Generally, I have tried various pure perl methods of packing and unpacking genomic data--there's one of my previous attempts here, and there are others kicking around--but given it would require saving 3 minutes (on 1e6 cycles) during the packing and unpacking to allow salva's in-memory sort approach to compete with the external sort, I fear that none of them are going to achieve much.

A pair of Inline::C routines might achieve sufficient performance to make an in-memery approach viable, but given the shortness on the strings and the need to transition perl->C->perl for every 34 chars, even that seems unlikely. If the inputs were always multiples of 4-bytes and always 4-byte aligned, maybe. But dealing with the edge cases where those two conditions cannot be guarenteed makes things much slower.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."