in reply to concatenating multiple lines without using . operator

I think the safer approach to accomplishing this is to use something like this:
#!/usr/bin/env perl use strict; use warnings; use Bio::SeqIO; use v5.10; #or later... or change 'say' to 'print' X_x my $fasta_in = "input.fa"; open my $fasta_out, ">", "output.fa"; my $seqio_in = Bio::SeqIO->new( -file => $fasta_in, -format => 'Fasta', ); my ( $seq_obj, %seq_hash ); while ( my $seq_obj = $seqio_in->next_seq() ) { my $seq_id = $seq_obj->display_id(); #this is the sequence ID my $seq = $seq_obj->seq(); #this is the actual sequen +ce $seq_hash{$seq_id} = $seq; #and hashed! #to print them to your screen in a "consolidated" FASTA format: say ">$seq_id"; say $seq_hash{$seq_id}; #to save to a file in a "consolidated" FASTA format: say $fasta_out ">$seq_id"; say $fasta_out $seq_hash{$seq_id}; } exit;

You can trim some of the stuff inside the while loop depending on what you actually want to do. For example, if you don't need to use the hash later, there is no point making it, etc.

I've tested this and it works. A sample input and corresponding output can be found here: https://gist.github.com/2928252.

Replies are listed 'Best First'.
Re^2: concatenating multiple lines without using . operator
by Cristoforo (Curate) on Jun 14, 2012 at 19:40 UTC
    To keep everything in 'fasta' format, you probably want to use Bio::SeqIO's write_seq().

    Sample showing output writing:

    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "input1.txt" , -format => 'fasta'); my $out = Bio::SeqIO->new( -file => '>test.dat', -format => 'fasta'); while ( my $seq = $in->next_seq() ) { if ($seq->id() =~ /^chr(\S*)$/) { $seq->display_id($1); # change id } $out->write_seq($seq); } __END__ *** input 1 >chr1 AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT TTTATCTTTAGGCGGTATGCACTTTTAACAAAAAANNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC CAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCNNNN >chrM GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAANAATTTCCACC >GJKKTUG01DYDGC GGGTATTCCTTCTCCACCTTGCAGCTAACATCAGTGTTTCGTCTACTCAAGCACGCCAAC ACGCCCTAGAGCGCCCTGTCCAGGGGATGGCAACCAACTCTGACCCTGCAAGTGCAGCAG ACATGAGGAATACAAACTACAATCTTTTACTTGATGATGCAATGCCGGACAAACTCTAGA >F0Z7V0F01EDB3V AAGGCGAGNGGTATCACGCAGTAAGTTACGGTTTTCGGGTAACGCGTCNGNGGNACTAAC CCACGGNGGGTAACCCGTCNCTACCGGTATAGGACTAAGGTTACCGGAACGTCGTGGGGT ACCCCCCGGACGGGGACCGTCCCCTCATANAGTCAACNGTNTGAGATGGACTAACTCAAA CCTAGTTTCAAGTACTATTTAACTTACTTACGTTACCCGTAATTTCGGCGTTTAGAGGCG
    Output:
    >1 AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAA AAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTAGGCGGTATGC ACTTTTAACAAAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCA TACCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCNNNN >M GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATTCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACCTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAANAATTTCCACC >GJKKTUG01DYDGC GGGTATTCCTTCTCCACCTTGCAGCTAACATCAGTGTTTCGTCTACTCAAGCACGCCAAC ACGCCCTAGAGCGCCCTGTCCAGGGGATGGCAACCAACTCTGACCCTGCAAGTGCAGCAG ACATGAGGAATACAAACTACAATCTTTTACTTGATGATGCAATGCCGGACAAACTCTAGA >F0Z7V0F01EDB3V AAGGCGAGNGGTATCACGCAGTAAGTTACGGTTTTCGGGTAACGCGTCNGNGGNACTAAC CCACGGNGGGTAACCCGTCNCTACCGGTATAGGACTAAGGTTACCGGAACGTCGTGGGGT ACCCCCCGGACGGGGACCGTCCCCTCATANAGTCAACNGTNTGAGATGGACTAACTCAAA CCTAGTTTCAAGTACTATTTAACTTACTTACGTTACCCGTAATTTCGGCGTTTAGAGGCG

    Chris

      My impression is that s/he wanted the sequence to be on a single line, whereas write_seq auto-formats fasta output to columns of 60 of nucleotides/amino acids. That's why I settled with:

      say $fasta_out $seq_hash{$seq_id};

      You should be able to set the width with $seq_obj->Bio::SeqIO::fasta::width($new_width). I'm able to set a new width and $seq_obj->Bio::SeqIO::fasta::width() returns this new width; however, I can't get it to actually print using the new width... it just reverts to 60. Any suggestions?

      -Mike

      edit: btw, the code I posted does keep the sequences in Fasta format.

        Hi Mike

        I meant no critcism towards your post, but I'm not sure whether Bio::SeqIO can read a file where all the sequence is on 1 line rather than 60 chars to a line. Perhaps it can.    :-)

        I just wanted readers to know that there is a 'write_seq()' method so they don't have to manually, (and without error), write out the 'id', 'decscription' or 'sequence'.

        Again, I didn't mean to be critical of your post, but just to make readers aware of the write_seq method. (And I wasn't aware of the 'width' method and how it might be used).

        Chris