monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Beloved Monks,

I have the following codes that take sequence of multiple lines then
concatenate if it doesn't begin with '>' and print the line index next to it.
See below for output.

Currently the solution I came up with is by using *array* and it somehow
looks clumsy. I couldn't think of a straighforward way that avoid using it.
In short a more elegant solution.

The reason why I want to do this is that I need to process the concatenated sequence
on the fly, without having to store it in array in the first place. There are practically
thousands of these sequences.

I wonder how would masters approach this.
#!/usr/bin/perl -w use strict; my @string; # Can I avoid using this array for the task? while(<DATA>){ s/\s//g; if (/^>/) { push(@string,''); next; } chomp; # concatenate current line # to the last array item $string[-1] .= $_; } print $_+1," : ", $string[$_],"\n" foreach (0..$#string); __DATA__ > Seq 1 (two lines) AAAAAAAAAAAAA CCAAAAAAAAAAA > Seq 2 (two lines) AAAAAAAAAAAAA AAAAAAAAAAAAA > Seq 3 (one line) TTTTTTTTTTTTAACTGAAGATTCGC
The desired output as the current code also gives is:
1 : AAAAAAAAAAAAACCAAAAAAAAAAA 2 : AAAAAAAAAAAAAAAAAAAAAAAAAA 3 : TTTTTTTTTTTTAACTGAAGATTCGC
Thanks so much beforehand.
Regards,
Edward
  • Comment on How to avoid using array in concatenating string of multiple lines OR How To Read FASTA
  • Select or Download Code

Replies are listed 'Best First'.
Re: How to avoid using array in concatenating string of multiple lines
by Zaxo (Archbishop) on Dec 09, 2004 at 10:58 UTC

    You can accumulate the current string in a scalar and then print and reset it as soon as you know you'll want to. The code will look a lot like what you have.

    #!/usr/bin/perl -w use strict; my ($string, $count); while(<DATA>){ s/\s//g; next if !$count and /^>/; if (/^>/ and $count) { print $count, ' : ', $string, $/; $string = ''; $count++; next; } $count || $count++; $string .= $_; } print $count, ' : ', $string, $/; __DATA__ > Seq 1 (two lines) AAAAAAAAAAAAA CCAAAAAAAAAAA > Seq 2 (two lines) AAAAAAAAAAAAA AAAAAAAAAAAAA > Seq 3 (one line) TTTTTTTTTTTTAACTGAAGATTCGC
    I've removed a fencepost error by checking that the > line is not the first.

    Having a leading marker instead of a trailing one makes this a little awkward.

    After Compline,
    Zaxo

Re: How to avoid using array in concatenating string of multiple lines
by reneeb (Chaplain) on Dec 09, 2004 at 12:02 UTC
    You can also use Bio::FASTASequence::File:
    use Bio::FASTASequence::File; my $file = '/path/to/seq.fa'; my $obj = Bio::FASTASequence::File->new($file); my $result_ref = $obj->get_result(); my $counter = 1; foreach(keys(%{$result_ref})){ print $counter,": ",$result_ref->{$_}->getSequence(),"\n"; $counter++; }


    You sequences:
    >Seq1 (two lines) AAAAAAAAAAAAA CCAAAAAAAAAAA >Seq2 (two lines) AAAAAAAAAAAAA AAAAAAAAAAAAA >Seq3 (one line) TTTTTTTTTTTTAACTGAAGATTCGC
Re: How to avoid using array in concatenating string of multiple lines
by snowcrash (Friar) on Dec 09, 2004 at 11:01 UTC
    #!/usr/bin/perl -w use strict; my $i = 0; while(<DATA>){ s/\s//g; if (/^>/) { print "\n" if $i; print ++$i, " : "; next; } chomp; print; } print "\n";
Re: How to avoid using array in concatenating string of multiple lines
by stajich (Chaplain) on Dec 09, 2004 at 13:07 UTC
    I'd suggest using already written modules or looking at code from people who have already solved this problem. Bio::SeqIO for one. See the code in the next_seq method. The beauty is if you want to change the file format to genbank you replace 'fasta' with 'genbank'.
    use Bio::SeqIO; my $in = Bio::SeqIO->new(-format => 'fasta', -fh => \*DATA); my $i =1; while( my $s = $in->next_seq ) { print $i++, " : ", $seq->seq(), "\n"; } __DATA__ > Seq 1 (two lines) AAAAAAAAAAAAA CCAAAAAAAAAAA > Seq 2 (two lines) AAAAAAAAAAAAA AAAAAAAAAAAAA > Seq 3 (one line) TTTTTTTTTTTTAACTGAAGATTCGC
      Thanks so much for your reply stajich.

      Indeed the module is very very useful.
      However I find a problem while extending the usage. Perhaps you can give some advice.

      Suppose I am taking a file as input (the content of the file is similar to my OP), and wish to process that file in *multiple trials*. So I have the following code.
      #!/usr/bin/perl -w use strict; use Bio::SeqIO; my $file = $ARGV[0]; open INFILE, "<$file" or die "$0: Can't open file $file: $!"; for (my $trial = 1; $trial <=2; $trial++) { seek(INFILE,0,0); #This is line 10 print "Trial $trial\n"; my $i =1; my $in = Bio::SeqIO->new(-format => 'fasta', -fh => \*INFILE); while( my $seq = $in->next_seq ) { print $i++, " : ", $seq->seq(), "\n"; } }
      The my code above (especially in Bio::SeqIO method) encounter this warning while arriving at second trial.
      Trial 1 1 : TGCAATCACTAGCAAGCTCTCGCTGCCGTCACTAGCCTGTGG 2 : GGGGCTAGGGTTAGTTCTGGANNNNNNNNNNNNNNNNNNNNN seek() on closed filehandle INFILE at test.pl line 10. Trial 2 readline() on closed filehandle INFILE at /usr/lib/perl5/site_perl/5.8 +.0/Bio/Root/IO.pm line 440.
      I know that I can avoid this warnings by replacing SEEK function with "open INFILE.."
      But I am curious how can I solve this problem if I intend to keep the SEEK function.
      Since I found the solution is neater that way. Hope to hear from you again.
      Regards,
      Edward
        Then open the file within the for-loop:

        #!/usr/bin/perl -w use strict; use Bio::SeqIO; my $file = $ARGV[0]; for (my $trial = 1; $trial <=2; $trial++) { open INFILE, "<$file" or die "$0: Can't open file $file: $!"; seek(INFILE,0,0); #This is line 10 print "Trial $trial\n"; my $i =1; my $in = Bio::SeqIO->new(-format => 'fasta', -fh => \*INFILE); while( my $seq = $in->next_seq ) { print $i++, " : ", $seq->seq(), "\n"; } }
        Running the while loop
        while( my $seq = $in->next_seq ) { print $i++, " : ", $seq->seq(), "\n"; }
        will read until the end of the filehandle. (That is why the loop ended in the first place). So subsequently calling next_seq on the $in object will give you the error mesg you are seeing. You can
        1. open SeqIO object outside of trial loop, and put seek(INFILE,0) inside the loop. Note that if the SeqIO object gets destroyed (or goes out of scope), you will need to add the flag -noclose => 1 option when initing the SeqIO object or else the filehandle is closed. But if you are going to do this, just move the initialization of the SeqIO object outside the loop.
        2. re-open the file each time in the list (move the Bio::SeqIO initialization into the trial loop)
        3. Or read all the sequences in at once and keep them in memory. (put them into an array).
          my @seqs; while(my $s = $in->next_seq ) { push @seqs, $s; } # now do your loop of trials
        #3 might not work depending on how many sequences and how big they are.
Re: How to avoid using array in concatenating string of multiple lines
by Anonymous Monk on Dec 09, 2004 at 10:43 UTC
    $/ = "\n>";
    That is also clumsy, for several reasons:
    1. You have to delete the trailing ">", at the end of every string (but the last)
    2. You have to delete the leading ">", at the start of every line
    3. You still have to delete every newline, and chomp won't work, because of the changed setting for $/.

    Oh well. Such is life.

    s/^>//; s/>$//; tr/<n//d;
Re: How to avoid using array in concatenating string of multiple lines
by pingo (Hermit) on Dec 09, 2004 at 10:44 UTC
    This is most likely not the best or prettiest way of doing it, but I think it works. :-)
    my $counter = 1; my $tmp = ''; foreach(<DATA>) { chomp; if (!/^>/) { $tmp .= $_; } else { print $counter++, " : $tmp\n" if length $tmp; $tmp = ""; } } print $counter, " : $tmp\n" if length $tmp;


    Update: Oops, missing chomp.
Re: How to avoid using array in concatenating string of multiple lines
by BrowserUk (Patriarch) on Dec 09, 2004 at 11:19 UTC

    Caveat: Different quotes for different folks!

    perl -ple"BEGIN{$/=qq'\n>'}s[>? Seq (\d+).*$][$1 : ]m; tr[\n.][]d" in +>out

    Examine what is said, not who speaks.        The end of an era!
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: How to avoid using array in concatenating string of multiple lines
by sasikumar (Monk) on Dec 09, 2004 at 11:49 UTC
    Hi, I Feel this is better.
    use strict; open(DATA,"< c:\\temp.txt") || die "Sorry"; print grep (!/^>/,<DATA>);
    As usual there are lot of ways to do the same in perl
    Thanks
    Sasi Kumar Oops sorry i miss understood Here its goes
    use strict; open(DATA,"< c:\\temp.txt") || die "Sorry"; while(<DATA>){ if ($_=~s/(^>\s+)//){ print "\n"; } else { chomp; print; } }
Re: How to avoid using array in concatenating string of multiple lines
by si_lence (Deacon) on Dec 09, 2004 at 11:09 UTC
    Yet another version
    si_lence
    use strict; my $count=1; while(<DATA>){ s/\s//g; chomp; /^>/ ? print "\n " . $count++ . ": " : print; } __DATA__ > Seq 1 (two lines) AAAAAAAAAAAAA CCAAAAAAAAAAA > Seq 2 (two lines) AAAAAAAAAAAAA AAAAAAAAAAAAA > Seq 3 (one line) TTTTTTTTTTTTAACTGAAGATTCGC
Re: How to avoid using array in concatenating string of multiple lines
by Fletch (Bishop) on Dec 09, 2004 at 11:12 UTC
    $ grep '^[ACGT]' foo | cat -n 1 AAAAAAAAAAAAA 2 CCAAAAAAAAAAA 3 AAAAAAAAAAAAA 4 AAAAAAAAAAAAA 5 TTTTTTTTTTTTAACTGAAGATTCGC

    TMTOWTDI, some of them not involving perl at all . . .

    Update: D'oh! Never mind me. That'll teach me to try and make a cogent point on 5 hours sleep . . .

      Nice idea, but it doesn't do what the OP wants. He wants to concatenate each sequence into one line, then print it and its sequence number. TMTOWTDI is only useful if the other ways are equivalent.