This flat file is composed of consecutive fasta sequences (like the one above), and approaches 9 gig in size. What I am trying to do is slurp the entire file into an array (crazy, but my unix sys. can handle it:-) and then parse out the individual sequnces into sub arrays. Due to varying size, I can't use @subarray=splice(@list, 0, 11); to pull out each sequence. I have to separate the sequences based on the > symbol. What would be the simplist way to say "slurp in the multiple lines of data between > and > and place them into subarray X"? My main worry is trying to code this in a way so that perl can keep its place along the way, so I don't get the same sequence pulled out 20,000 times rather then 20,000 different sequence arrays. As well, if there are any suggestions on making this as painless a memory hog as possible, I would greatly appreciate it. I apologize if this seems a dumb question, but I'm a self-taught perl hacker, and still pretty new, though playing with some nice Unix toys....>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu +lin heavy chain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT +ATAATAATGA CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA +GTCCCATAAA CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA +AGGGTCTGGA ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC +ATCTCCAGAG ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG +TGCTCGGGGC GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC +CACCGACAGT GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC +GGATTCTGCC TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
In reply to Refomating a large fasta file... by bioinformatics
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |