newbie25 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm a newbie when it comes to perl and am stuck on a simple script. I am trying to take two different text files (exact same size and alignment, line by line) and put them together into one file.
File 1 looks like this (its a fasta dna sequence):
>GK9PVB108JH8SN rank=0000021 x=3781.0 y=885.0 length=137 TGATTATGAGTTAGATGTTCGCTCTGAGGTTTCAACGATGCTTCAAGATTCCTAATTCGC GTTGCGACTCTCGAGTATGCGTTCTATTCACATTTCTGTTGTCGTACATATTTGACTCAC GATCTTGATTTCTTATC <br> >GK9PVB108JYDQ5 rank=0000032 x=3965.0 y=143.0 length=53 GTTTTCAACGCTGGTTCGAGATTTCCTAATTTCACATTGCGACTCTCGAGTGC
File 2 is the complement of 1 and looks like this (its a quality assessment of the above data):
>GK9PVB108JH8SN rank=0000021 x=3781.0 y=885.0 length=137 40 38 38 30 30 30 38 40 40 30 30 30 36 40 40 40 40 40 40 40 40 40 40 4 +0 39 34 34 14 14 14 14 13 19 14 18 25 32 32 34 27 >GK9PVB108JYDQ5 rank=0000032 x=3965.0 y=143.0 length=53 25 21 21 23 23 37 31 31 31 37 37 39 38 40 40 40 40 40 40 40 40 39 39 3 +9 39 35 23 25 25 37 34 35 36 37 37 37 37 37 39 39 40 40 40 40 39 38 3 +5 33 33 33 32 31 19
What I want is a file that combines these to put the quality data under the corresponding sequence
eg: >GK9PVB108JYDQ5 rank=0000032 x=3965.0 y=143.0 length=53 GTTTTCAACGCTGGTTCGAGATTTCCTAATTTCACATTGCGACTCTCGAGTGC >GK9PVB108JYDQ5 rank=0000032 x=3965.0 y=143.0 length=53 25 21 21 23 23 37 31 31 31 37 37 39 38 40 40 40 40 40 40 40 40 39 39 3 +9 39 35 23 25 25 37 34 35 36 37 37 37 37 37 39 39 40 40 40 40 39 38 3 +5 33 33 33 32 31 19
The script I have written is:
use warnings; use strict; open( F1, "subsettest.txt" ) or die "file1: $!"; open( F2, "subsettest1.txt" ) or die "file2: $!"; while ( my$s1 = <F1> ) { my$s2 = <F2>; if ( $s1 =~ /^>/ ) { print $s1; } else { chomp $s1; print "$s1\n$s2"; } } close F1; close F2; exit;
The problem is that this adds the second file line by line into the first file instead of underneath it. eg:
>GK9PVB108I6QD3 rank=0000053 x=3650.5 y=393.0 length=71 ACACTTTAGCGGGACATTATTACAAGAAGGTACCTGAACCACATCGGGTTTCCTTGCTTC 40 40 38 34 21 21 21 30 36 40 35 33 33 36 32 32 33 29 28 26 26 28 31 2 +6 26 31 24 24 23 23 34 25 27 27 32 32 22 22 22 23 32 32 34 34 34 19 1 +9 19 27 26 26 16 16 14 14 29 29 27 27 31 TTCAACGGTAA
What am I missing?!?!!

Replies are listed 'Best First'.
Re: combine 2 fasta files into 1
by NetWallah (Canon) on Jul 30, 2010 at 20:25 UTC
    The formatting of the files it not clear.

    If you indicate every appearance of a newline by <br> , then the second file does not contain any newlines, and so, the single <F2> will suck in the entire file, giving you the results you seem to have.

    It would be a good idea to check if you actualy got somethin in that read:

    defined my $s2 = <F2> or die "I really need F2 to contain the same number of records +as F1:$!";
    To improve readability, please add a space after each "my".

         Syntactic sugar causes cancer of the semicolon.        --Alan Perlis

Re: combine 2 fasta files into 1
by BioLion (Curate) on Jul 31, 2010 at 10:06 UTC

    Not actually a fix to your code, but a slightly different approach, would be to temporarily store each FASTA record in a hash using the complete header as a key. This way you can check for existence of a known header and print out the records together.

    I haven't tested it, but the problem you are getting looks like it could be a buffering issue? If you ++$|; # turn off buffering, the issue might go away, but I still think i would go with the hash approach as it doesn't rely on the records being in the same order etc... HTH

    Just a something something...
Re: combine 2 fasta files into 1
by Generoso (Prior) on Jul 30, 2010 at 21:21 UTC

    It look like the F1 file has HTML in it and a <br> looks like a new line but it is not,
    is just for formatting when you look at it with a browser like Internet Explorer.

      Maybe this is what you are looking for.

      use warnings; use strict; open( F1, "subsettest.txt" ) or die "file1: $!"; open( F2, "subsettest1.txt" ) or die "file2: $!"; while ( my $s1 = <F1> ) { my $s2 = <F2>; if ( $s1 =~ /^>/ ) { print $s1; } else { print $s2,$s1; } } close F1; close F2; exit;
      >GK9PVB108JH8SN rank=0000021 x=3781.0 y=885.0 length=137 40 38 38 30 30 30 38 40 40 30 30 30 36 40 40 40 40 40 40 40 40 40 40 4 +0 39 34 34 14 14 14 14 13 19 14 18 25 32 32 34 27 TGATTATGAGTTAGATGTTCGCTCTGAGGTTTCAACGATGCTTCAAGATTCCTAATTCGCGTTGCGACTC +TCGAGTATGCGTTCTATTCACATTTCTGTTGTCGTACATATTTGACTCACGATCTTGATTTCTTATC >GK9PVB108JYDQ5 rank=0000032 x=3965.0 y=143.0 length=53 25 21 21 23 23 37 31 31 31 37 37 39 38 40 40 40 40 40 40 40 40 39 39 3 +9 39 35 23 25 25 37 34 35 36 37 37 37 37 37 39 39 40 40 40 40 39 38 3 +5 33 33 33 32 31 19 GTTTTCAACGCTGGTTCGAGATTTCCTAATTTCACATTGCGACTCTCGAGTGC
        Thanks for your suggestion but removing the chomp function of my script doesn't solve my problem. It is still giving output that looks like this:
        >GK9PVB108JRQBH rank=0000011 x=3889.0 y=1131.0 length=82 TCCATGTGTACAACTCATATGGAGCATCGATAGTATTAACAGTCTTGGTTGTGCGAGTTC 19 19 19 32 32 32 32 23 23 15 15 15 19 19 24 30 29 30 30 27 19 19 20 3 +0 35 35 33 31 31 31 31 32 32 30 23 16 15 15 15 24 28 28 25 22 17 17 1 +7 17 23 21 21 21 24 28 24 19 18 17 16 19 TTTGTTGTTTCCTTTAACTAAC
        instead of:
        >GK9PVB108JRQBH rank=0000011 x=3889.0 y=1131.0 length=82 TCCATGTGTACAACTCATATGGAGCATCGATAGTATTAACAGTCTTGGTTGTGCGAGTTCTTTGTTGTTT +CCTTTAACTAAC 19 19 19 32 32 32 32 23 23 15 15 15 19 19 24 30 29 30 30 27 19 19 20 3 +0 35 35 33 31 31 31 31 32 32 30 23 16 15 15 15 24 28 28 25 22 17 17 1 +7 17 23 21 21 21 24 28 24 19 18 17 16 19

        Ok form what I figure you have several lines in F1 and you what to skip the first corresponding line in F2.

        use warnings; use strict; open( F1, "subsettest.txt" ) or die "file1: $!"; open( F2, "subsettest1.txt" ) or die "file2: $!"; my @s2s = (); while ( my $s1 = <F1> ) { my $s2 = <F2>; if ( $s1 =~ /^>/ ) { foreach (@s2s) {print $_;} print $s1; @s2s = (); } else { push (@s2s,$s2); print $s1; } } foreach (@s2s) {print $_;} close F1; close F2; exit;