If they are as the previous post said fasta or qual format then bioperl is next to none for parsing sequences formats (although will complain they don't look like sequences if they have funny characters in there 'looks like you're using scores').
You say: 'I cannot store anything in arrays or variables since I have to parse 3 GB file.'.
'The possible size of an array is only limited by how much memory you have'.
Could you possibly post the code you've tried so we can see how you're storing stuff.
You
may be able to combat the problem by using references.
Storing ref's of nucleotides:
'Example, would need to see your code to tailor it better':
my $a_ref = \'A';
my $c_ref = \'C';
my $t_ref = \'T';
# etc...
# Then storing these values in an array reference:
my $base_ref;
while ( <$fh> ) {
# Get the correct values you need
my $nuc_ref = $base eq 'A' ? $a_ref
: $base eq 'C' ? $c_ref
: $base eq 'G' ? $g_ref
: $base eq 'T' ? $t_ref
: $n_ref;
push @{$base_ref}, $nuc_ref;
}
What is this doing?
Well now each element in the array is now just a
reference to ( A, T, C, G, N ), and
not a char in each element. See perlreftut for more info*
You may still have some trouble with upper limit of arrays etc. but i've read in about 10,000 files to a single data structure before without a hitch. It's all about how you do it.
Update: If I've seriously overlooked something please say.
If you could post some examples of what you've tried we may be able to streamline it.
Hope that helps-
john
Ps. First post had a good idea about database
* See
perlreftut for more information not sure if I can explain it that well