FIJI42 has asked for the wisdom of the Perl Monks concerning the following question:

I have a sample subroutine that parses through two fasta-formatted files containing several genes/gene sequences and returns two sets of hashes: one with values as nucleotide sequences, and another with values as description information - both have gene names as keys.

Ultimately, I want to print the hash information into two separate files in fasta format (i.e. gene name and description in header, then sequence underneath, for each gene). However, I don't know how to print the information to the files in the correct format - any suggestions how to do this?

Here's my code so far

#!/usr/bin/perl use strict; use warnings; my ($file1, $file2) = @ARGV; my $out1 = 'output1.fasta'; my $out2 = 'output2.fasta'; open (FILE1, "<$file1") or die "Cannot open input file: $file1 !\n"; open (FILE2, "<$file2") or die "Cannot open input file: $file2 !\n"; open (my $output1, ">$out1") or die "Cannot open output file: $out1 !\ +n"; open (my $output2, ">$out2") or die "Cannot open output file: $out2 !\ +n"; my @file1=<FILE1>; my @file2=<FILE2>; my %File1=(); my %File1_D=(); my %File2=(); my %File2_D=(); Parse2Files (\@file1, \@file2, \%File1, \%File2, \%File1_D, \%File2_D) +; my @temp1a = keys %File1; ## gene name my @temp1b = values %File1_D; ## description my @temp1c = values %File1; ## gene seq print $output1 "@temp1a, @temp1b, @temp1c\n"; my @temp2a = keys %File2; ## gene name my @temp2b = values %File2_D; ## description my @temp2b = values %File2; ## gene seq print $output2 "@temp2a, @temp2b, @temp2c\n"; sub Parse2Files { my ($arrayref1, $arrayref2, $hashref1, $hashref2) = @_; my @file1=@{$arrayref1}; my @file2=@{$arrayref2}; my $f1dscp=''; ## File 1 - fasta description my $f1name=''; ## File 1 - fasta gene name my $f2dscp=''; ## File 2 - fasta description my $f2name=''; ## File 2 - fasta gene name for (my $i=0; $i<=$#file1; $i++) { chomp($file1[$i]); if ($file1[$i] =~ m/^>(\S+)\s(.+)/) { $f1name = $1; $f1dscp = $2; $File1_D{$f1name} = $f1dscp; } else{ $File1{$f1name} .= $file1[$i]; } } for (my $i=0; $i<=$#file2; $i++) { chomp($file2[$i]); if ($file2[$i] =~ m/^>(\S+)\s(.+)/){ $f2name = $1; $f2dscp = $2; $File2_D{$f2name} = $f2dscp; } else{ $File2{$f2name} .= $file2[$i]; } } return (%File1,%File2, %File1_D, %File2_D); }

Yes, the code is basically taking two fasta files and putting them back into another fasta-formatted file - I'm just toying with this to practice making output filed.

Replies are listed 'Best First'.
Re: Print Output to New File
by 1nickt (Canon) on Nov 03, 2017 at 10:52 UTC

    Hi, as was pointed out in a fashion by another monk, you have a great deal of duplication in your code. One of the great things about variables is that they are, um, variable. That is, they can hold variable values. Thus, you don't need a different variable for each piece of data you are processing. Then, once you are reusing variables, you will find that you can reuse code that processes them.

    Another important technique is to use meaningful variable names, so that a reader of code, or you a week later, can understand what is going on.

    So for example I might gently rewrite your code to contain something like:

    ... my @infiles = @ARGV; for my $file ( @infiles ) { my ($descriptions, $sequences) = ParseFile( $file ); ... # output the data here } ... sub ParseFile { my $file = shift; my %descriptions = my %sequences = (); open (my $fh, '<', $file) or die "Cannot open input file $file: +$!\n"; while ( my $line = <$fh> ) { chomp $line; if ( $line =~ m/^>(\S+)\s(.+)/ ) { $descriptions{$1} = $2; } else { $sequences{$1} = $.; } } return (\%descriptions, \%sequences); } ...
    (example code only)

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Print Output to New File
by hippo (Archbishop) on Nov 03, 2017 at 09:35 UTC
    for my $key (keys %File1) { print "Gene: $key\nDescription: $File1_D{$key}\nData: $File1{$key} +\n"; }

    Once you understand this you can trivially convert it into FASTA format using a Template or format or your own routine.

Re: Print Output to New File
by holli (Abbot) on Nov 02, 2017 at 23:28 UTC
    Do you realize that you wrote the same program twice?


    holli

    You can lead your users to water, but alas, you cannot drown them.
Re: Print Output to New File (UPDATED)
by thanos1983 (Parson) on Nov 03, 2017 at 09:00 UTC

    Hello FIJI42,

    Regarding I want to print the hash information into two separate files in fasta format try sample of code below:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use YAML qw( DumpFile LoadFile ); my %hash = ( ';LCBO - Prolactin precursor - Bovine' => "a sample seque +nce in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHD +LSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPL +YHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTK +DED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*" ); p %hash; my $filename = 'test.txt'; DumpFile( $filename, \%hash ); my $reconstituted = LoadFile($filename); # check it out in Data::Dumper style- print Dumper $reconstituted; __END__ $ perl test.pl { ';LCBO - Prolactin precursor - Bovine' "a sample sequence in FAS +TA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHD +LSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPL +YHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTK +DED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*" } $VAR1 = { ';LCBO - Prolactin precursor - Bovine' => 'a sample sequence + in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHD +LSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPL +YHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTK +DED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*' }; $ cat test.txt --- ';LCBO - Prolactin precursor - Bovine': |- a sample sequence in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYI +HDLSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWND +PLYHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQ +TKDED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

    I do not have sample of your data to test but it should be ok.

    Update: Adding sample of fasta data and output that I found here FASTA format. Play around with the hash and the data that you want to populate.

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      Hi thanos1983,

      What is p %hash;? Typo?

      $ perl -c 1202669.pl Operator or semicolon missing before %hash at 1202669.pl line 13. Ambiguous use of % resolved as operator % at 1202669.pl line 13. Bareword "p" not allowed while "strict subs" in use at 1202669.pl line + 13. Bareword "hash" not allowed while "strict subs" in use at 1202669.pl l +ine 13. 1202669.pl had compilation errors


      The way forward always starts with a minimal test.

        Hello 1nickt,

        I was experimenting with Data::Printer as an alternative way.

        You can use it for example like:

        use Data::Printer; p %hash;

        This why you could see the printing part. Hope is more clear now. :)

        Update: Sample of code that I was experimenting, see below:

        #!/usr/bin/perl use strict; use warnings; use Data::Printer; my %hash = ( ';LCBO - Prolactin precursor - Bovine' => "a sample seque +nce in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHD +LSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPL +YHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTK +DED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*" ); my $filename = 'output.txt'; open(my $fh, '>', $filename) or die "Could not open file '$filename' $!"; p(%hash, output => $fh); close $fh or warn "Could not open file '$filename' $!"; print "Done\n"; __END__ $ perl test.pl Done $ cat output.txt { ';LCBO - Prolactin precursor - Bovine' "a sample sequence in FAS +TA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHD +LSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPL +YHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTK +DED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*" }

        BR / Thanos

        Seeking for Perl wisdom...on the process of learning...not there...yet!