Re: Bioinformatic task
by GrandFather (Saint) on Nov 07, 2010 at 22:56 UTC
|
Define 'big'. One way to handle the problem for smaller values of 'big' is to read the entire file into memory then use a regular expression to cut it up for you.
For larger values of big you may be able to use something like:
use strict;
use warnings;
local $/ = ">Seq";
while (<DATA>) {
chomp;
next if ! length;
print "Record: $_\n";
}
__DATA__
>Seq1
AAATTTGGG.....
>Seq2
AGATTTACC.....
True laziness is hard work
| [reply] [d/l] |
|
|
Thanks for replying. I actually wrongly used the word 'big'. I must read the whole file into memory, where one structure would be for the headers and another one for the sequences. My computer definitely has enough RAM for that.
| [reply] |
|
|
I strongly second what aquarium implies in Re: Bioinformatic task - using a single structure containing records with two parts is much better than trying to keep two parallel arrays in sync.
However, it's probably worth your while to tell us more about the problem you are trying to solve. I suspect there are other areas where you could use a little help with this problem!
True laziness is hard work
| [reply] |
|
|
For the OP's purpose, dont use a regular expression just to cut up a string with a static delimiter. Regular expressions aren't the answer for every last parsing problem in Perl. They are 10x slower, no matter how basic, than a couple line index and substr algorithm. Use substr and index (see my post here Re: Is Using Threads Slower Than Not Using Threads?), or as you showed, by redefining input record separator and reading "by line".
| [reply] |
Re: Bioinformatic task
by BrowserUk (Patriarch) on Nov 07, 2010 at 23:41 UTC
|
Does the DNA occupy multiple lines between the single header lines? (Ie. Is this a standard FASTA format file?)
If so, this might work for you:
#! perl -slw
use strict;
use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 80;
local $/ = '>';
my %sequences;
(undef) = scalar <DATA>; ## Discard first delimiter
while( my $record = <DATA> ) {
my @lines = split "\n", $record;
pop @lines if $lines[-1] eq '>';
my $id = shift @lines;
$sequences{ $id } = join'', @lines;
}
pp \%sequences;
__DATA__
>uc002yje.1 chr21:13973492-13974491
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaacccggccccgcagagccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
>uc002yje.2 chr21:13974492-13975432
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
>uc002yje.3 chr21:13975431-13976330
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
Outputs c:\test>fasta
{
"uc002yje.1 chr21:13973492-13974491" => "cccctgccccaccgcaccctggattac
+tgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagccc
+tggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG",
"uc002yje.2 chr21:13974492-13975432" => "cccctgccccaccgcaccctggattac
+tgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaaaaaacccggccccgcaga
+gccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG",
"uc002yje.3 chr21:13975431-13976330" => "cccctgccccaccgcaccctggattac
+tgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagggc
+cctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG",
}
| [reply] [d/l] [select] |
Re: Bioinformatic task
by aquarium (Curate) on Nov 07, 2010 at 23:23 UTC
|
if the header is always that format, then you wouldn't need to store the whole header, as it just becomes a contiguous sequence of increasing integers...and hence $Sequences[1] is already known to be for header "Seq1" implicitly. And if the headers are not that straight forth, then depending exactly on the processing required, probably opt for a hash of array or array of hashes, which keeps the header directly tied to the sequence, rather than separate header and separate sequence data structures.
the hardest line to type correctly is: stty erase ^H
| [reply] [d/l] |
Re: Bioinformatic task
by TomDLux (Vicar) on Nov 07, 2010 at 23:51 UTC
|
You don't saywhy it needs to be that way.
If your two arrays are the pre-requisite for other code, then go for it. But I think it would be better to use a hash, with the header as the key and the sequence as the value. If you might have multiple instances of a header, use arrays of sequences as the values:
while ( my $header = <$fh> ) {
my $sequence = <$fh>;
chomp ( $header, $sequence );
$data{$header} = $sequence;
# or
push @{ $data{header} }, $sequence;
# to read
print "$_ -> $data{$_}.\n" for sort keys %data;
# or for the hash of arrays
for my $key ( sort keys %data ) {
print "$key : $_.\n" for @ { $data{$key} };
}
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] [d/l] |
Re: Bioinformatic task
by j1n3l0 (Friar) on Nov 08, 2010 at 00:04 UTC
|
| [reply] |
|
|
Why? It is slow, clumsy and memory hungry.
It requires the installation of literally thousands of lines of code to achieve what can be done more simply and efficiently in 4 lines of Perl.
| [reply] |
|
|
It depends - if you know that this code will only ever run with a file where you now the format exactly then that's certainly true. Otherwise, it's quite easy, especially for the beginner, to use assumptions that my not always be met, e.g. that the sequence is always in a single line. Then you run your code against a different file and fail to realise that you have read only fractions of the real sequence and all your results are invalid. I would say that it is better to use a tried and tested module like BIo::SeqIO unless you are dealing with Gigabyte-sized files and speed is crucial.
| [reply] |
|
|
|
|
|
|
|
|
Re: Bioinformatic task
by uvnew (Acolyte) on Nov 08, 2010 at 02:36 UTC
|
Thank you all for your replies. The full task that I need to perform is to compare all possible sequence pairs combinations (so if I have 3 sequences then I will compare 1-2, 1-3 and 2-3) by sending them to an external program (for which I need only the DNA sequence and not need the header), and then if the comparison meets certain criteria then I save the corresponding headers into an output file (and speaking of which: would it be more time efficient if I keep all the output in memory and then save it to file at the end of run?). I originally used BioPerl which was very convenient, but unfortunately too slow: I really need to try and use memory as much as possible so it will not take months to finish. | [reply] |
|
|
typically how many sequences in the input file, and how long each DNA chain?
what does your "compare" operation do? if it's testing for exact match then you don't need to..as you can test a sequence key exists before assigning. so if you put in this test for prior existence before assignment, you're essentially checking the incoming sequence against every sequence already loaded in the hash.
and if the comparison of the DNA chains is more involved, then another structure more suited to the comparison operation could help make the task more managable/efficient.
the hardest line to type correctly is: stty erase ^H
| [reply] |
|
|
a straight equality comparison can also be done by sort | uniq -c unix/linux utilities, which will give you the count of the common sequences. but you could also just work with the sort output with a script that notices when the sequence changes from the last read one. store the sequence identifier in second column and only use the first column (sequence) for the sorting.
the hardest line to type correctly is: stty erase ^H
| [reply] |
|
|
I'd suggest you post either some code here or a more detailed description of what your code acutally does. I have a suspicion that it is not a BioPerl problem but the design of the algorithm that makes it so slow. From your question I would guess that you are still learning Perl, right? One common beginner's mistake is to repeatedly read large sequence files where a single pass would have been sufficient. What are you using to compare the sequences? For example, if you are running something like ClustalW of all possible pairs of a large file of sequences then this will take forever. You could then speed this up by running a MEGABLAST or BLASTCLUST or something like that to first identify sequences with some similarity and then only do the global alignments for those where it makes sense. There is also a plethorea of new aligners for next-gen sequencing data out there, which could also be useful, depending on your data.
| [reply] |
|
|
If you found the module was too slow, you're clearly talking about huge sequence, not merely a few thousand characters.
If you're trying every pairing, seems to me that's going to be N * (N-1) / 2 pairs.
You can choose any one of N for the first, but one fewer for the second; A-B is the same as B-A.
All pairings means you need random access, so you want them in memory. Huge means You're going to use up your memory.
Have you considered using a simple memory/file database, maybe BerkelyDB or SQLLite, to help manage the storage and swapping?
is there any way you can do some hierarchival competition, rather than all-pairs?
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] |