comment on

Hello all
The NCBI database containing sequence information for hundreds of species allows for dl of major portions of their database, allowing for local searches of sequence information rather than having too many people hog the bandwidth. I have dl a specific flat file containing sequence information in fasta format, looking as such:

>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavy chain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA
+GTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA
+AGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC
+ATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG
+TGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC
+CACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC
+GGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
[download]

This flat file is composed of consecutive fasta sequences (like the one above), and approaches 9 gig in size. What I am trying to do is slurp the entire file into an array (crazy, but my unix sys. can handle it:-) and then parse out the individual sequnces into sub arrays. Due to varying size, I can't use @subarray=splice(@list, 0, 11); to pull out each sequence. I have to separate the sequences based on the > symbol. What would be the simplist way to say "slurp in the multiple lines of data between > and > and place them into subarray X"? My main worry is trying to code this in a way so that perl can keep its place along the way, so I don't get the same sequence pulled out 20,000 times rather then 20,000 different sequence arrays. As well, if there are any suggestions on making this as painless a memory hog as possible, I would greatly appreciate it. I apologize if this seems a dumb question, but I'm a self-taught perl hacker, and still pretty new, though playing with some nice Unix toys....

Bioinformatics

In reply to Refomating a large fasta file... by bioinformatics

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.