Bioinformatic task

uvnew has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Bioinformatic task by GrandFather (Saint) on Nov 07, 2010 at 22:56 UTC
Define 'big'. One way to handle the problem for smaller values of 'big' is to read the entire file into memory then use a regular expression to cut it up for you. For larger values of big you may be able to use something like: `use strict; use warnings; local $/ = ">Seq"; while (<DATA>) { chomp; next if ! length; print "Record: $_\n"; } __DATA__ >Seq1 AAATTTGGG..... >Seq2 AGATTTACC.....` [download] True laziness is hard work	[reply] [d/l]
Re^2: Bioinformatic task by uvnew (Acolyte) on Nov 07, 2010 at 23:18 UTC
Thanks for replying. I actually wrongly used the word 'big'. I must read the whole file into memory, where one structure would be for the headers and another one for the sequences. My computer definitely has enough RAM for that.	[reply]
Re^3: Bioinformatic task by GrandFather (Saint) on Nov 07, 2010 at 23:41 UTC
I strongly second what aquarium implies in Re: Bioinformatic task - using a single structure containing records with two parts is much better than trying to keep two parallel arrays in sync. However, it's probably worth your while to tell us more about the problem you are trying to solve. I suspect there are other areas where you could use a little help with this problem! True laziness is hard work	[reply]
Re^2: Bioinformatic task by patcat88 (Deacon) on Nov 09, 2010 at 03:59 UTC
For the OP's purpose, dont use a regular expression just to cut up a string with a static delimiter. Regular expressions aren't the answer for every last parsing problem in Perl. They are 10x slower, no matter how basic, than a couple line index and substr algorithm. Use substr and index (see my post here Re: Is Using Threads Slower Than Not Using Threads?), or as you showed, by redefining input record separator and reading "by line".	[reply]
Re: Bioinformatic task by BrowserUk (Patriarch) on Nov 07, 2010 at 23:41 UTC
Does the DNA occupy multiple lines between the single header lines? (Ie. Is this a standard FASTA format file?) If so, this might work for you: #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 80; local $/ = '>'; my %sequences; (undef) = scalar <DATA>; ## Discard first delimiter while( my $record = <DATA> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $id = shift @lines; $sequences{ $id } = join'', @lines; } pp \%sequences; __DATA__ >uc002yje.1 chr21:13973492-13974491 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.2 chr21:13974492-13975432 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.3 chr21:13975431-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG [download] Outputs c:\test>fasta { "uc002yje.1 chr21:13973492-13974491" => "cccctgccccaccgcaccctggattac +tgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagccc +tggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.2 chr21:13974492-13975432" => "cccctgccccaccgcaccctggattac +tgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaaaaaacccggccccgcaga +gccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.3 chr21:13975431-13976330" => "cccctgccccaccgcaccctggattac +tgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagggc +cctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", } [download]	[reply] [d/l] [select]
Re: Bioinformatic task by aquarium (Curate) on Nov 07, 2010 at 23:23 UTC
if the header is always that format, then you wouldn't need to store the whole header, as it just becomes a contiguous sequence of increasing integers...and hence `$Sequences[1]` is already known to be for header "Seq1" implicitly. And if the headers are not that straight forth, then depending exactly on the processing required, probably opt for a hash of array or array of hashes, which keeps the header directly tied to the sequence, rather than separate header and separate sequence data structures. the hardest line to type correctly is: stty erase ^H	[reply] [d/l]
Re: Bioinformatic task by TomDLux (Vicar) on Nov 07, 2010 at 23:51 UTC
You don't saywhy it needs to be that way. If your two arrays are the pre-requisite for other code, then go for it. But I think it would be better to use a hash, with the header as the key and the sequence as the value. If you might have multiple instances of a header, use arrays of sequences as the values: `while ( my $header = <$fh> ) { my $sequence = <$fh>; chomp ( $header, $sequence ); $data{$header} = $sequence; # or push @{ $data{header} }, $sequence; # to read print "$_ -> $data{$_}.\n" for sort keys %data; # or for the hash of arrays for my $key ( sort keys %data ) { print "$key : $_.\n" for @ { $data{$key} }; }` [download] As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply] [d/l]
Re: Bioinformatic task by j1n3l0 (Friar) on Nov 08, 2010 at 00:04 UTC
Perhaps you should look into using Bio::SeqIO to parse your fasta files ... and other bioinformatics tasks =) Smoothie, smoothie, hundre prosent naturlig!	[reply]
Re^2: Bioinformatic task by BrowserUk (Patriarch) on Nov 08, 2010 at 00:26 UTC
Why? It is slow, clumsy and memory hungry. It requires the installation of literally thousands of lines of code to achieve what can be done more simply and efficiently in 4 lines of Perl.	[reply]
Re^3: Bioinformatic task by tospo (Hermit) on Nov 08, 2010 at 09:34 UTC
It depends - if you know that this code will only ever run with a file where you now the format exactly then that's certainly true. Otherwise, it's quite easy, especially for the beginner, to use assumptions that my not always be met, e.g. that the sequence is always in a single line. Then you run your code against a different file and fail to realise that you have read only fractions of the real sequence and all your results are invalid. I would say that it is better to use a tried and tested module like BIo::SeqIO unless you are dealing with Gigabyte-sized files and speed is crucial.	[reply]
Re^4: Bioinformatic task by BrowserUk (Patriarch) on Nov 08, 2010 at 10:34 UTC
Re^5: Bioinformatic task by erix (Prior) on Nov 08, 2010 at 13:57 UTC
Some notes below your chosen depth have not been shown here
Re^5: Bioinformatic task by tospo (Hermit) on Nov 08, 2010 at 12:20 UTC
Some notes below your chosen depth have not been shown here
Re: Bioinformatic task by uvnew (Acolyte) on Nov 08, 2010 at 02:36 UTC
Thank you all for your replies. The full task that I need to perform is to compare all possible sequence pairs combinations (so if I have 3 sequences then I will compare 1-2, 1-3 and 2-3) by sending them to an external program (for which I need only the DNA sequence and not need the header), and then if the comparison meets certain criteria then I save the corresponding headers into an output file (and speaking of which: would it be more time efficient if I keep all the output in memory and then save it to file at the end of run?). I originally used BioPerl which was very convenient, but unfortunately too slow: I really need to try and use memory as much as possible so it will not take months to finish.	[reply]
Re^2: Bioinformatic task by aquarium (Curate) on Nov 08, 2010 at 05:22 UTC
typically how many sequences in the input file, and how long each DNA chain? what does your "compare" operation do? if it's testing for exact match then you don't need to..as you can test a sequence key exists before assigning. so if you put in this test for prior existence before assignment, you're essentially checking the incoming sequence against every sequence already loaded in the hash. and if the comparison of the DNA chains is more involved, then another structure more suited to the comparison operation could help make the task more managable/efficient. the hardest line to type correctly is: stty erase ^H	[reply]
Re^3: Bioinformatic task by aquarium (Curate) on Nov 08, 2010 at 21:57 UTC
a straight equality comparison can also be done by sort \| uniq -c unix/linux utilities, which will give you the count of the common sequences. but you could also just work with the sort output with a script that notices when the sequence changes from the last read one. store the sequence identifier in second column and only use the first column (sequence) for the sorting. the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: Bioinformatic task by tospo (Hermit) on Nov 08, 2010 at 09:44 UTC
I'd suggest you post either some code here or a more detailed description of what your code acutally does. I have a suspicion that it is not a BioPerl problem but the design of the algorithm that makes it so slow. From your question I would guess that you are still learning Perl, right? One common beginner's mistake is to repeatedly read large sequence files where a single pass would have been sufficient. What are you using to compare the sequences? For example, if you are running something like ClustalW of all possible pairs of a large file of sequences then this will take forever. You could then speed this up by running a MEGABLAST or BLASTCLUST or something like that to first identify sequences with some similarity and then only do the global alignments for those where it makes sense. There is also a plethorea of new aligners for next-gen sequencing data out there, which could also be useful, depending on your data.	[reply]
Re^2: Bioinformatic task by TomDLux (Vicar) on Nov 08, 2010 at 19:39 UTC
If you found the module was too slow, you're clearly talking about huge sequence, not merely a few thousand characters. If you're trying every pairing, seems to me that's going to be N * (N-1) / 2 pairs. You can choose any one of N for the first, but one fewer for the second; A-B is the same as B-A. All pairings means you need random access, so you want them in memory. Huge means You're going to use up your memory. Have you considered using a simple memory/file database, maybe BerkelyDB or SQLLite, to help manage the storage and swapping? is there any way you can do some hierarchival competition, rather than all-pairs? As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply]