in reply to Bioinformatic task

Thank you all for your replies. The full task that I need to perform is to compare all possible sequence pairs combinations (so if I have 3 sequences then I will compare 1-2, 1-3 and 2-3) by sending them to an external program (for which I need only the DNA sequence and not need the header), and then if the comparison meets certain criteria then I save the corresponding headers into an output file (and speaking of which: would it be more time efficient if I keep all the output in memory and then save it to file at the end of run?). I originally used BioPerl which was very convenient, but unfortunately too slow: I really need to try and use memory as much as possible so it will not take months to finish.

Replies are listed 'Best First'.
Re^2: Bioinformatic task
by aquarium (Curate) on Nov 08, 2010 at 05:22 UTC
    typically how many sequences in the input file, and how long each DNA chain?
    what does your "compare" operation do? if it's testing for exact match then you don't need to..as you can test a sequence key exists before assigning. so if you put in this test for prior existence before assignment, you're essentially checking the incoming sequence against every sequence already loaded in the hash.
    and if the comparison of the DNA chains is more involved, then another structure more suited to the comparison operation could help make the task more managable/efficient.
    the hardest line to type correctly is: stty erase ^H
      a straight equality comparison can also be done by sort | uniq -c unix/linux utilities, which will give you the count of the common sequences. but you could also just work with the sort output with a script that notices when the sequence changes from the last read one. store the sequence identifier in second column and only use the first column (sequence) for the sorting.
      the hardest line to type correctly is: stty erase ^H
Re^2: Bioinformatic task
by tospo (Hermit) on Nov 08, 2010 at 09:44 UTC
    I'd suggest you post either some code here or a more detailed description of what your code acutally does. I have a suspicion that it is not a BioPerl problem but the design of the algorithm that makes it so slow. From your question I would guess that you are still learning Perl, right? One common beginner's mistake is to repeatedly read large sequence files where a single pass would have been sufficient. What are you using to compare the sequences? For example, if you are running something like ClustalW of all possible pairs of a large file of sequences then this will take forever. You could then speed this up by running a MEGABLAST or BLASTCLUST or something like that to first identify sequences with some similarity and then only do the global alignments for those where it makes sense. There is also a plethorea of new aligners for next-gen sequencing data out there, which could also be useful, depending on your data.
Re^2: Bioinformatic task
by TomDLux (Vicar) on Nov 08, 2010 at 19:39 UTC

    If you found the module was too slow, you're clearly talking about huge sequence, not merely a few thousand characters. If you're trying every pairing, seems to me that's going to be N * (N-1) / 2 pairs. You can choose any one of N for the first, but one fewer for the second; A-B is the same as B-A.

    All pairings means you need random access, so you want them in memory. Huge means You're going to use up your memory. Have you considered using a simple memory/file database, maybe BerkelyDB or SQLLite, to help manage the storage and swapping?

    is there any way you can do some hierarchival competition, rather than all-pairs?

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.