rarenas has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I started learning perl two weeks ago and I have a practice assignment to convert fasta files into tabular format using perl. I managed to write a program that converted a fasta into tbl but only when there is one sequence in the fasta file. If I have multiple sequences in a fasta file, I cannot manage to convert them to tbl properly. This is the code I wrote that works with one sequence in the fasta:

#!/usr/bin/perl # fasta_to_tbl.pl use strict; use warnings; die "Please specify suitable file\n" if (@ARGV != 1); my ($fasta) = @ARGV; my $outfile = "$fasta.tbl"; open(my $in, "<", "$fasta") or die "error reading $fasta. $!"; open(my $out, ">", "$outfile") or die "error creating $outfile. $!"; my $identifier = ""; my $union = ""; while (<$in>) { chomp; next unless m/\w/; if ($_ =~ m/>/) { # Identifier! $_ =~ s/>//; ($identifier) = split /\:|\s|\||,|;/, $_; print "$identifier\n"; } else { # We have a line with sequence $union = $union . $_; } } print $out "$identifier\t$union\n"; close($in); close($out);

I realized that it would be a lot better to use hashes instead of arrays to separate the different sequences. I want to have the sequence title/name be the key and the sequence be the value. I also thought it would be good to use the local command so that I can separate based on ">" symbol instead of by line because all fasta file titles start with that symbol. I am stuck on actually implementing those realizations and then using a loop to edit the formatting for each sequence. Any suggestions? Thank you in advance!

I am using the simple fasta file below for practice but do note that many fasta files contain extra information in the title and may have a space between the title and the sequence. We only want the name in tbl and not the extra information. The code above takes care of those extras only for one sequence.

Fasta format

>sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC >sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT CACTGGCGCGCGGGCGAGCGCACGGGCGCTC >sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT CTCCCCTCCCC >sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGTTAATGGGAACTTC AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAG TCTGCACACC

Tabular format

>sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCC +GCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC >sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGC +CGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGC +GCTACAGCCAGCCTCACTGGCGCGCGGGCGAGCGCACGGGCGCTC >sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTT +CCAGACGCGGGATCTCCCCTCCCC >sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGT +TAATGGGAACTTCAGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAAT +CTGCACAGAGCCAGTCTGCACACC

Replies are listed 'Best First'.
Re: Converting fasta (with multiple sequences) into tabular using perl
by 1nickt (Canon) on Dec 12, 2017 at 17:36 UTC

    Hi, here is one solution. I'm using Path::Tiny for file handling. Also localizing the special variable $/ (the input record separator), as you hinted at.

    use strict; use warnings; use feature 'say'; use Path::Tiny; local $/ = '>'; my $fh = path('./foo.fasta')->openr; while ( my $paragraph = <$fh> ) { chomp $paragraph; my @lines = split /\n/, $paragraph or next; my ( $identifier, $string ); for my $line ( @lines ) { if ( $line =~ /(sequence\d+)/ ) { $identifier = $1; } else { $string .= $line; } } say "$identifier\t$string"; } __END__
    I used the following input file:
    >sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC > sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT CACTGGCGCGCGGGCGAGCGCACGGGCGCTC >randomstuff sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT CTCCCCTCCCC > sequence4 blahblah CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGTTAATGGGAACTTC AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAG TCTGCACACC
    and got the following output:
    $ perl foo.pl sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCC +GCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGC +CGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGC +GCTACAGCCAGCCTCACTGGCGCGCGGGCGAGCGCACGGGCGCTC sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTT +CCAGACGCGGGATCTCCCCTCCCC sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGT +TAATGGGAACTTCAGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAAT +CTGCACAGAGCCAGTCTGCACACC

    Hope this helps!


    The way forward always starts with a minimal test.

      Thank you! This was indeed helpful. :)

Re: Converting fasta (with multiple sequences) into tabular using perl
by jwkrahn (Abbot) on Dec 12, 2017 at 20:23 UTC

    This appears to do what you require:

    #!/usr/bin/perl # fasta_to_tbl.pl use strict; use warnings; die "Please specify suitable file\n" if @ARGV != 1; my ( $fasta ) = @ARGV; my $outfile = "$fasta.tbl"; open my $in, '<', $fasta or die "error reading $fasta. $!"; open my $out, '>', $outfile or die "error creating $outfile. $!"; while ( <$in> ) { if ( /^>/ ) { # Identifier! # "\t" TAB character for "tabular format" s/[:\s|,;].*/\t/s; print $out $_; next; } s/\s//g; if ( !/\w/ || eof $in ) { # empty line or end of file print $out "$_\n\n"; } else { # Data print $out $_; } }

      Thank you! This is helpful!

Re: Converting fasta (with multiple sequences) into tabular using perl
by kcott (Archbishop) on Dec 13, 2017 at 02:32 UTC

    G'day rarenas,

    Welcome to the Monastery.

    I'd probably use a technique more along these lines:

    #!/usr/bin/env perl use 5.014; use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { chomp; substr $_, 0, 1, '' if $. == 1; my ($head, $seq) = split /\n/, $_, 2; printf "%-15s %s\n", $head, $seq =~ y/\n//dr; } } __DATA__ >SEQ1 AAA >SEQ2 AAA CCC >SEQ3 AAA CCC GGG >SEQ4 plus ... AAA CCC GGG TTT

    Output:

    SEQ1 AAA SEQ2 AAACCC SEQ3 AAACCCGGG SEQ4 plus ... AAACCCGGGTTT

    Notes:

    • If you are the Anonymous Monk to whom I responded with "Re: unique sequences" (a couple of days ago), some of that code should look familiar. If you're not that person, or just need a revision, see that post for actual I/O and links to documentation for various elements used in this post.
    • See "perl5140delta: Non-destructive substitution" for the 'r' modifier I've used with y///. If you're using a Perl version earlier than 5.14, you'll need to do the transliteration (or transobliteration as it's been called when the 'd' modifier is used) as a separate step prior to printf.

    — Ken

      Thank you Ken!

Re: Converting fasta (with multiple sequences) into tabular using perl
by Marshall (Canon) on Dec 13, 2017 at 17:47 UTC
    Here is another technique for your toolbox. I don't see the need to separate the identifier or create $union.
    #!/usr/bin/perl use strict; use warnings; while (my $line = <DATA>) { process_record($line) if $line=~ /^>/; } sub process_record { my $line = shift; $line =~ s/\s*$//; #or chomp $line print "$line\t"; while (defined ($line=<DATA>) and $line !~ /^\s*$/) { $line =~ s/\s*$//; #or chomp $line print $line; } print "\n"; } =Prints: >sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGC +CGCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC >sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCG +CCGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCG +CGCTACAGCCAGCCTCACTGGCGCGCGGGCGAGCGCACGGGCGCTC >sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTT +TCCAGACGCGGGATCTCCCCTCCCC >sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCG +TTAATGGGAACTTCAGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAA +TCTGCACAGAGCCAGTCTGCACACC =cut __DATA__ >sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC >sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT CACTGGCGCGCGGGCGAGCGCACGGGCGCTC >sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT CTCCCCTCCCC >sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGTTAATGGGAACTTC AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAG TCTGCACACC
    Update:
    I missed the idea that there could be some "kruft" after the id because an example of that was not present in the example data provided. I would recommend something like this change to process_record():
    (my $id) = $line =~ /^(>\w+)/; #allow for kruft after id print "$id\t";
    In general, use a regex when you know what you want to extract. Use a split when you know what you want to throw away. With the above regex, it is not necessary to enumerate all of these potential non-word things like "|;(space)", etc. However, note that underscore "_" is a "word character" (any character valid in a Perl variable name is allowed for \w). If underscore could directly follow the sequence id, then of course the regex would need to change. However, I see no evidence that would be necessary.

      Thanks for the response! The multi fasta file I put up is a very simple file but I know, from working with fasta files, that sometimes the titles can be complex and may be divided in strange ways. Sometimes there isn't a space between the name and the rest of the information but instead other symbols that can be used as separators. That's the reason why I had all those non-word symbols. But, as you say, it is not necessary if I know what I want to throw away. This really helps! Thanks again!

Re: Converting fasta (with multiple sequences) into tabular using perl
by rarenas (Acolyte) on Dec 14, 2017 at 10:53 UTC

    Thank you all for responding. All your responses were helpful. I find it funny that what I needed was so short and simple in code. Have a lot to learn. Thanks again! :)

      If you are working in this field you may be interested in bioperl.