Extract multiple lists od Identifiers from a FASTA file

joluito has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody,

I'm trying to extract a set of diferent ID lists from a FASTA file in a batch.

I have a code that allows me to do it but it only extracts the first list of the loop (even though it reads the other lists and tell how many od each list's ID are found in the FASTA file).

There must be an issue with the OUTFILE opening/closing in the for loop, but I seem unable to find what the problem is.

Any help would be appreciated.

Here is the code (and I can provide dummy test files if needed):

#!usr/bin/perl -w
use strict;
use diagnostics;
use Getopt::Long;

#usage example: perl /path_to_script/listret.pl -p /path_to_lists_dir 
+-f /path_to_fasta/FastaFile.faa

my ($path,$fasta);
GetOptions(
    'path=s'          => \$path,
    'fasta=s'          => \$fasta,
    );
#print "$path\n";
chdir $path or die "ERROR: Unable to enter $path: $!\n";
opendir (TEMP , ".");
my @files = readdir (TEMP);
closedir TEMP;
my $name;
my $found=0;
my $totalist=0;

for my $file (@files) {
     
   if($file=~/(\w+)\.fasta/){
            
    $name = "$1";
open (INFILE2, "$path/$name.txt") || die ("cannot open input file");
    chomp(my @lista = <INFILE2>);
    $/ = "\>";
open (INFILE, "$path/$name.fasta") || die ("cannot open input file");
    chomp(my @data = <INFILE>);
open  OUT,'>'."$path/$name.out" or die "ERROR: Unable to open $file $!
+\n";
             
    foreach my $li (@lista){
        chomp $li;
        print"$li\n";
        $totalist++;
        for(@data){
                        
            if(/$li/){
                $found++;        
                print ">"."$_\n"; 
                print OUT ">"."$_\n";
                }
            }
        }
    print "For $name we found $found of a total $totalist\n";
    }
}
[download]

Example of FASTA format:

>VFG000033(gb|WP_002208793) 
MPPAARLSLLQRSSTMSDFLPFALPDIGEAEIQAVTESMRSGWLTTGPNAREFEREFAAY
IGADVEAVAVNSATAGLHLALEAIGVGPGDEVITTTHTFTASAEVARYLGAEPVLVDIDP
ATLCISPAAIERAITPRTRAIVPVHYGGLSCDMDSILEIARKHGLKVIEDAAHALPASWQ
GRRIGSLESDLTVYSFYATKTLATGEGGMVVTRDPALAKRCRVMRLHGIDRDAFDRFTSK
KPAWYYEIVAPGFKYNMTDTAAAMGRVQLQRVQQMRDRRAQIAAAYDQAFADLPLTLPPG
PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP
YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP
>VFG000036(gb|NP_490509) 
MQFIDLKTQYQALRDTINPRIQAVLDHGQFIMGPEVKELEAALCAYTGAKHCITVASGTE
ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT
PRTKAIIPVSLYGQCGDMDEV
[download]

Example of ID list:

WP_002208793
WP_002208792
WP_002211763
WP_002211762
NP_490508
NP_490509
NP_459538
NP_459540
[download]

I expect to get a new FASTA file (for each of the ID lists) containing only the sequences matching to the ID on the specific list. The code expects to, basically, be a kind of Sequences retriever in batch. I mean, I expect the script to:

-open the original FASTA file

-open each of the ID lists

-read and chomp it

-match & retrieve the correspondent sequences from the original FASTA

-write them to individual files (named like the lists but in FASTA format)

The output should ne in FASTA format (the same format in the above example).

Leaving aside the format specifications, I think the issue here is why it only works in the first "for" loop iteration and not in the following ones. Can you spot any error I can't, regarding the outfile opening or how I try to write my results in it inside the loop?

Thank you very much in advance.

Comment on Extract multiple lists od Identifiers from a FASTA file Select or Download Code

Replies are listed 'Best First'.
Re: Extract multiple lists od Identifiers from a FASTA file by Fletch (Bishop) on Aug 04, 2022 at 12:05 UTC
I think Bio::SeqIO from bioperl is probably still the recommended solution rather than rolling your own parser. That aside you’re missing parens on open when you create your output file so it’s not parsing the way you intend and you may not be seeing errors from there if you do have problems. Using the three arg open and parens would be better/current best practice. `my $outfile = qq{$path/$name.faa}; open( my $out_fh, q{>}, $outfile ) or die qq{problem writing to ‘$outf +ile’: $!};` [download] The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l]
Re^2: Extract multiple lists of Identifiers from a FASTA file by hippo (Bishop) on Aug 04, 2022 at 12:26 UTC
you’re missing parens on open when you create your output file so it’s not parsing the way you intend This isn't wrong but equally the parens are not needed if `or` is used for control flow instead of `\|\|`. All three of these work (my preference in the comments) `open( my $out_fh, q{>}, $outfile ) \|\| die qq{problem writing to '$outf +ile': $!}; # OK open( my $out_fh, q{>}, $outfile ) or die qq{problem writing to '$outf +ile': $!}; # Better open my $out_fh, q{>}, $outfile or die qq{problem writing to '$outfile +': $!}; # Best` [download] The last of these is an idiom I use all the time (albeit with actual quotation marks rather than q/qq). I find that absence of unnecessary punctuation makes for clearer reading but recognise that not everyone sees it the same way. There is also autodie if you don't want or need to write your own error messages. All the rest of Fletch's advice is spot on. I'd also point you (joluito) to the Basic debugging checklist for help in finding out where things are going wrong. 🦛	[reply] [d/l] [select]
Re^3: Extract multiple lists of Identifiers from a FASTA file by BillKSmith (Monsignor) on Aug 04, 2022 at 12:58 UTC
I agree with hippo's choice, but... To paraphrase the introduction to the book "Perl Best Practices": The style is not very important. What is important is that you chose one that works for you and always use it. Bill	[reply]
Re: Extract multiple lists od Identifiers from a FASTA file by kcott (Archbishop) on Aug 04, 2022 at 23:24 UTC
G'day joluito, Welcome to the Monastery. Yes, some dummy input and output files would have probably been useful. You should keep these small but still representative of your real data. I see a variety of comments on your code. I won't revisit these but I will make two points: When you change a special variable, such as `$/`, you should localise it and restrict the change to the smallest possible scope. See perlvar and local. Avoid reading all data into an array; then reading it all again from that array. There are occasions when this is useful, but generally it isn't: this is particularly true with biological data which, as I'm sure you're aware, can be huge. Reading and processing data one piece at a time with, for example, `while` and `for` loops, will reduce CPU cycles, memory requirements, and the overall time your program takes to run. The following code, `fasta_munge.pl`, shows how I might have handled your requirements. #!/usr/bin/env perl use strict; use warnings; use autodie; my $dir = '.'; opendir(my $dh, $dir); for (readdir $dh) { next unless /^(.+?)\.fasta/; my ($id_path, $fasta_path) = ("$dir/$1.txt", "$dir/$_"); next unless -e $id_path; my %ids = map +($_ => 1), split ' ', do { local $/; open my $fh, '<', $id_path; <$fh>; }; { open my $fh, '<', $fasta_path; local $/ = '>'; while (<$fh>) { chomp; if (/\A[^\|]+\\|([^)]+)/) { print { _get_out_fh($dir, $1) } "$/$_" if $ids{$1}; } } } } _close_all_out_fhs(); closedir $dh; { my %out_fh_for; sub _get_out_fh { my ($dir, $id) = @_; unless (exists $out_fh_for{$id}) { open $out_fh_for{$id}, '>', "$dir/$id.fasta"; } return $out_fh_for{$id}; } sub _close_all_out_fhs { close $_ for values %out_fh_for; return; } } [download] I put together some dummy data for testing. Here's a full run with all the details. ken@titan ~/tmp/pm_11145943 $ ls -l total 8 -rwxr-xr-x 1 ken None 976 Aug 5 08:26 fasta_munge.pl -rw-r--r-- 1 ken None 257 Aug 5 08:28 one.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:32 one.txt -rw-r--r-- 1 ken None 402 Aug 5 08:28 two.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:35 two.txt ken@titan ~/tmp/pm_11145943 $ cat one.fasta >one:VFG000033(gb\|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >one:VFG000036(gb\|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat one.txt WP_002208793 NP_490509 ken@titan ~/tmp/pm_11145943 $ cat two.fasta >two:VFG000033(gb\|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >two:VFG000032(gb\|WP_002208792) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >two:VFG000036(gb\|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat two.txt WP_002208792 NP_490509 ken@titan ~/tmp/pm_11145943 $ ./fasta_munge.pl ken@titan ~/tmp/pm_11145943 $ ls -l total 11 -rwxr-xr-x 1 ken None 976 Aug 5 08:26 fasta_munge.pl -rw-r--r-- 1 ken None 224 Aug 5 08:31 NP_490509.fasta -rw-r--r-- 1 ken None 257 Aug 5 08:28 one.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:32 one.txt -rw-r--r-- 1 ken None 402 Aug 5 08:28 two.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:35 two.txt -rw-r--r-- 1 ken None 145 Aug 5 08:31 WP_002208792.fasta -rw-r--r-- 1 ken None 145 Aug 5 08:31 WP_002208793.fasta ken@titan ~/tmp/pm_11145943 $ cat NP_490509.fasta >one:VFG000036(gb\|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV >two:VFG000036(gb\|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat WP_002208792.fasta >two:VFG000032(gb\|WP_002208792) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP ken@titan ~/tmp/pm_11145943 $ cat WP_002208793.fasta >one:VFG000033(gb\|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP ken@titan ~/tmp/pm_11145943 $ [download] — Ken	[reply] [d/l] [select]
Re^2: Extract multiple lists od Identifiers from a FASTA file by joluito (Novice) on Aug 05, 2022 at 07:32 UTC
Hi Ken, Thank you very much for your help! I totally agree with you! But in this case, since this particular data is not too big to handle easily, I didn't care very much about computational requirements... my fault.	[reply]
Re: Extract multiple lists od Identifiers from a FASTA file by joluito (Novice) on Aug 05, 2022 at 07:25 UTC
The rude Anonymous person nailed it about changing the field separator. Here's is the FUNCTIONAL code (may indentation not be that good, and the code not obfuscated enough as to win a contest...but we need it to be mantained by different people, some of who are new to programming). #!usr/bin/perl -w use strict; use Getopt::Long; #usage example: perl /path_to_script/bay_seqret.pl -p /path_to_lists - +f /path_to_fasta/FASTA_file.faa my ($path, $fasta); GetOptions( 'path=s' => \$path, 'fasta=s' => \$fasta, ); chdir $path or die "ERROR: Unable to enter $path: $!\n"; opendir (TEMP , "."); my @files = readdir (TEMP); closedir TEMP; my $name; my $found=0; my $totalist=0; $/ = "\>";###CHANGED HERE TO SPLIT FASTA by ">" open (INFILE, "$path/$fasta") \|\| die ("cannot open input file"); chomp(my @data = <INFILE>); print"@data\n"; $/ = "\n";####"REGENERATED" HERE NOT TO INTERFERE WITH THE REST OF THE + DATA PROCESSING for my $file (@files) { if($file=~/(\w+)\.txt/){ $name = "$1"; print"Lista: $path/$name.txt\n"; open (INFILE2, "$path/$name.txt") \|\| die ("cannot open input f +ile"); chomp(my @lista = <INFILE2>); open OUT,'>'."$path/$name.out" or die "ERROR: Unable to open +$file $!\n"; foreach my $li (@lista){ chomp $li; print"$li\n"; $totalist++; for(@data){ if(/$li/){ $found++; print ">"."$_\n"; print OUT ">"."$_\n"; } } } print "For $name we found $found of a total $totalist\n"; } } [download]	[reply] [d/l]
Re: Extract multiple lists od Identifiers from a FASTA file by mr_mischief (Monsignor) on Aug 04, 2022 at 20:01 UTC
Never mind any of the following. I didn't notice after reading your description that you're setting new output names for new files over and over. Are you sure you're getting the first sequence and not the last? `open OUT,'>'."$path/$name.out" or die "ERROR: Unable to open $file $! +\n";` [download] open says about the MODE: If MODE is ">", the file is opened for output, with existing files first being truncated ("clobbered") and nonexisting files newly created. If MODE is ">>", the file is opened for appending, again being created if necessary. I would guess after looking a little closer at that line and that documentation you'll realize what's happening in your loop. There are other issues people have mentioned and even hammered upon, but truncating a file over and over is not going to get you cumulative output.	[reply] [d/l]
Re^2: Extract multiple lists od Identifiers from a FASTA file by joluito (Novice) on Aug 05, 2022 at 07:28 UTC
Thank you for your insight mr.mischief!	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


go ahead... be a heretic
	PerlMonks