crayfish has asked for the wisdom of the Perl Monks concerning the following question:

I have some biology sequence data like follows:

>b_comp_seq1 ACGCGGGGGAATTT >b_comp_seq_2 ACGGGCTTTCACC ..... >b_comp_seq_64 ACCCGGGAATT
while I want extract these sequence with 4 sequence in a separate file with a name, that means I have 64 sequences I want them separate into 16 files with each have 4 sequences and with a different name.

Replies are listed 'Best First'.
Re: separate files
by kcott (Archbishop) on Jan 01, 2014 at 06:07 UTC

    G'day crayfish,

    Welcome to the monastery.

    As already pointed out in previous replies (and a private message), there's a number of problems with what you've posted that makes it difficult for us to know exactly what you need help with.

    You've presented your data as a string. Delving into the HTML source, I see it's actually formatted into a number of records. choroba sent you a private message about <code>...</code> tags; if you don't know how to update your post, see "How do I change/delete my post?".

    In addition to the formatting of the data, there's potentially a problem with its accuracy. Note "seq1" (without underscore) and "seq_2" (with underscore). I'll leave you to fix that, if necessary.

    Your description of the required output is also somewhat vague. For instance, did you want all, some or none of the ">b_comp_..." headers included? It's much better to provide a small example of what you're expecting.

    Please read the guidelines in "How do I post a question effectively?". If you post a question with all the appropriate information, you're likely to get good answers in (generally) a short space of time. You might also like to look at "How (Not) To Ask A Question" which explains, in more detail, some of the issues with your original post.

    The following code presents techniques that may be suitable for your current needs. It's intentionally generic because, as stated, it's unclear exactly what you want.

    #!/usr/bin/env perl use strict; use warnings; my ($group_size, $group) = (2, 0); local $/ = "\n>"; while (<DATA>) { print '*** Filename: XYZ_', ++$group, "\n" unless ($. - 1) % $grou +p_size; chop; print '>' unless $. == 1; print; } print "\n"; __DATA__ >b_comp_seq1 ACGCGGGGGAATTT >b_comp_seq_2 ACGGGCTTTCACC >b_comp_seq3 ACGCGGGGGAATTT >b_comp_seq_4 ACGGGCTTTCACC

    Output:

    *** Filename: XYZ_1 >b_comp_seq1 ACGCGGGGGAATTT >b_comp_seq_2 ACGGGCTTTCACC *** Filename: XYZ_2 >b_comp_seq3 ACGCGGGGGAATTT >b_comp_seq_4 ACGGGCTTTCACC

    Finally, I don't know what expertise you have with Perl. A good starting point for reference documentation is perldoc.perl.org: it begins with an introduction, tutorials and FAQs; more in-depth discussions of topics follow. When you've actually written some code, feel free to ask about any specific difficulties you encounter.

    -- Ken

Re: separate files
by ww (Archbishop) on Dec 31, 2013 at 23:25 UTC

    Here's a start, since you've not (yet?) replied to Kenosis' very proper request that you show us what you've tried (and how it failed to satisfy your needs, with error and warning messages, verbatim) as tokens of good faith that you're here to learn; not simply to have others do your work:

    #!/usr/bin/perl use 5.016; use warnings; # 1068826 my $data = '>b_comp_seq1 ACGCGGGGGAATTT >b_comp_seq2 ACGGAATT >b_comp_ +seq3 GGGGGCTTTCACC >b_comp_seq4 TACCGGGAATT'; my @data1 = split /\s/, $data; my $i=0; my ($data1, $fh, $fn); for $_(@data1) { $_ =~ />(.*)/; $fn = $1; if ( $i%2 == 0 ) { open $fh, ">", $fn or warn "Can't open $fh for write, $!"; ++$i; print $fh "$fn \t $data1[$i]"; say " DEBUG: \$fn, \$data1[$i]: $fn, $data1[$i]"; close $fh; } else { ++$i; #ALSO -- # consider how to revert to writing to the first file for comp +_seq_5 # if that's the way you wish to order the elements; otherwise, + revise # the code above to put the first four sequences and their sou +rce # id's into the first file written. } } =head output files contain... b_comp_seq1 ACGCGGGGGAATTT b_comp_seq2 ACGGAATT b_comp_seq3 GGGGGCTTTCACC b_comp_seq4 TACCGGGAATT =cut

    Left as a challenge for crayfish: adding sequences (and identifiers?) to the four files created here. Hint: hashes might be one way to go.


    If you didn't program your executable by toggling in binary, it wasn't really programming!

Re: separate files
by Kenosis (Priest) on Dec 31, 2013 at 20:22 UTC

    Please share the code you attempted to use to parse this fasta file, along with your results.