Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Extract multiple lists od Identifiers from a FASTA file

by joluito (Novice)
on Aug 04, 2022 at 11:20 UTC ( [id://11145943]=perlquestion: print w/replies, xml ) Need Help??

joluito has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody,

I'm trying to extract a set of diferent ID lists from a FASTA file in a batch.

I have a code that allows me to do it but it only extracts the first list of the loop (even though it reads the other lists and tell how many od each list's ID are found in the FASTA file).

There must be an issue with the OUTFILE opening/closing in the for loop, but I seem unable to find what the problem is.

Any help would be appreciated.

Here is the code (and I can provide dummy test files if needed):

#!usr/bin/perl -w use strict; use diagnostics; use Getopt::Long; #usage example: perl /path_to_script/listret.pl -p /path_to_lists_dir +-f /path_to_fasta/FastaFile.faa my ($path,$fasta); GetOptions( 'path=s' => \$path, 'fasta=s' => \$fasta, ); #print "$path\n"; chdir $path or die "ERROR: Unable to enter $path: $!\n"; opendir (TEMP , "."); my @files = readdir (TEMP); closedir TEMP; my $name; my $found=0; my $totalist=0; for my $file (@files) { if($file=~/(\w+)\.fasta/){ $name = "$1"; open (INFILE2, "$path/$name.txt") || die ("cannot open input file"); chomp(my @lista = <INFILE2>); $/ = "\>"; open (INFILE, "$path/$name.fasta") || die ("cannot open input file"); chomp(my @data = <INFILE>); open OUT,'>'."$path/$name.out" or die "ERROR: Unable to open $file $! +\n"; foreach my $li (@lista){ chomp $li; print"$li\n"; $totalist++; for(@data){ if(/$li/){ $found++; print ">"."$_\n"; print OUT ">"."$_\n"; } } } print "For $name we found $found of a total $totalist\n"; } }

Example of FASTA format:

>VFG000033(gb|WP_002208793) MPPAARLSLLQRSSTMSDFLPFALPDIGEAEIQAVTESMRSGWLTTGPNAREFEREFAAY IGADVEAVAVNSATAGLHLALEAIGVGPGDEVITTTHTFTASAEVARYLGAEPVLVDIDP ATLCISPAAIERAITPRTRAIVPVHYGGLSCDMDSILEIARKHGLKVIEDAAHALPASWQ GRRIGSLESDLTVYSFYATKTLATGEGGMVVTRDPALAKRCRVMRLHGIDRDAFDRFTSK KPAWYYEIVAPGFKYNMTDTAAAMGRVQLQRVQQMRDRRAQIAAAYDQAFADLPLTLPPG PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >VFG000036(gb|NP_490509) MQFIDLKTQYQALRDTINPRIQAVLDHGQFIMGPEVKELEAALCAYTGAKHCITVASGTE ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV

Example of ID list:

WP_002208793 WP_002208792 WP_002211763 WP_002211762 NP_490508 NP_490509 NP_459538 NP_459540

I expect to get a new FASTA file (for each of the ID lists) containing only the sequences matching to the ID on the specific list. The code expects to, basically, be a kind of Sequences retriever in batch. I mean, I expect the script to:

-open the original FASTA file

-open each of the ID lists

-read and chomp it

-match & retrieve the correspondent sequences from the original FASTA

-write them to individual files (named like the lists but in FASTA format)

The output should ne in FASTA format (the same format in the above example).

Leaving aside the format specifications, I think the issue here is why it only works in the first "for" loop iteration and not in the following ones. Can you spot any error I can't, regarding the outfile opening or how I try to write my results in it inside the loop?

Thank you very much in advance.

Replies are listed 'Best First'.
Re: Extract multiple lists od Identifiers from a FASTA file
by Fletch (Bishop) on Aug 04, 2022 at 12:05 UTC

    I think Bio::SeqIO from bioperl is probably still the recommended solution rather than rolling your own parser. That aside you’re missing parens on open when you create your output file so it’s not parsing the way you intend and you may not be seeing errors from there if you do have problems. Using the three arg open and parens would be better/current best practice.

    my $outfile = qq{$path/$name.faa}; open( my $out_fh, q{>}, $outfile ) or die qq{problem writing to ‘$outf +ile’: $!};

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      you’re missing parens on open when you create your output file so it’s not parsing the way you intend

      This isn't wrong but equally the parens are not needed if or is used for control flow instead of ||. All three of these work (my preference in the comments)

      open( my $out_fh, q{>}, $outfile ) || die qq{problem writing to '$outf +ile': $!}; # OK open( my $out_fh, q{>}, $outfile ) or die qq{problem writing to '$outf +ile': $!}; # Better open my $out_fh, q{>}, $outfile or die qq{problem writing to '$outfile +': $!}; # Best

      The last of these is an idiom I use all the time (albeit with actual quotation marks rather than q/qq). I find that absence of unnecessary punctuation makes for clearer reading but recognise that not everyone sees it the same way. There is also autodie if you don't want or need to write your own error messages.

      All the rest of Fletch's advice is spot on. I'd also point you (joluito) to the Basic debugging checklist for help in finding out where things are going wrong.


      🦛

        I agree with hippo's choice, but... To paraphrase the introduction to the book "Perl Best Practices": The style is not very important. What is important is that you chose one that works for you and always use it.
        Bill
Re: Extract multiple lists od Identifiers from a FASTA file
by kcott (Archbishop) on Aug 04, 2022 at 23:24 UTC

    G'day joluito,

    Welcome to the Monastery.

    Yes, some dummy input and output files would have probably been useful. You should keep these small but still representative of your real data.

    I see a variety of comments on your code. I won't revisit these but I will make two points:

    • When you change a special variable, such as $/, you should localise it and restrict the change to the smallest possible scope. See perlvar and local.
    • Avoid reading all data into an array; then reading it all again from that array. There are occasions when this is useful, but generally it isn't: this is particularly true with biological data which, as I'm sure you're aware, can be huge. Reading and processing data one piece at a time with, for example, while and for loops, will reduce CPU cycles, memory requirements, and the overall time your program takes to run.

    The following code, fasta_munge.pl, shows how I might have handled your requirements.

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $dir = '.'; opendir(my $dh, $dir); for (readdir $dh) { next unless /^(.+?)\.fasta/; my ($id_path, $fasta_path) = ("$dir/$1.txt", "$dir/$_"); next unless -e $id_path; my %ids = map +($_ => 1), split ' ', do { local $/; open my $fh, '<', $id_path; <$fh>; }; { open my $fh, '<', $fasta_path; local $/ = '>'; while (<$fh>) { chomp; if (/\A[^|]+\|([^)]+)/) { print { _get_out_fh($dir, $1) } "$/$_" if $ids{$1}; } } } } _close_all_out_fhs(); closedir $dh; { my %out_fh_for; sub _get_out_fh { my ($dir, $id) = @_; unless (exists $out_fh_for{$id}) { open $out_fh_for{$id}, '>', "$dir/$id.fasta"; } return $out_fh_for{$id}; } sub _close_all_out_fhs { close $_ for values %out_fh_for; return; } }

    I put together some dummy data for testing. Here's a full run with all the details.

    ken@titan ~/tmp/pm_11145943 $ ls -l total 8 -rwxr-xr-x 1 ken None 976 Aug 5 08:26 fasta_munge.pl -rw-r--r-- 1 ken None 257 Aug 5 08:28 one.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:32 one.txt -rw-r--r-- 1 ken None 402 Aug 5 08:28 two.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:35 two.txt ken@titan ~/tmp/pm_11145943 $ cat one.fasta >one:VFG000033(gb|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >one:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat one.txt WP_002208793 NP_490509 ken@titan ~/tmp/pm_11145943 $ cat two.fasta >two:VFG000033(gb|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >two:VFG000032(gb|WP_002208792) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP >two:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat two.txt WP_002208792 NP_490509 ken@titan ~/tmp/pm_11145943 $ ./fasta_munge.pl ken@titan ~/tmp/pm_11145943 $ ls -l total 11 -rwxr-xr-x 1 ken None 976 Aug 5 08:26 fasta_munge.pl -rw-r--r-- 1 ken None 224 Aug 5 08:31 NP_490509.fasta -rw-r--r-- 1 ken None 257 Aug 5 08:28 one.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:32 one.txt -rw-r--r-- 1 ken None 402 Aug 5 08:28 two.fasta -rw-r--r-- 1 ken None 23 Aug 5 05:35 two.txt -rw-r--r-- 1 ken None 145 Aug 5 08:31 WP_002208792.fasta -rw-r--r-- 1 ken None 145 Aug 5 08:31 WP_002208793.fasta ken@titan ~/tmp/pm_11145943 $ cat NP_490509.fasta >one:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV >two:VFG000036(gb|NP_490509) ALLISLMALGVKAGDEVITTSFTFVATAEVIALLGAKPVFVDVEPDTCNIKVSEIEAKIT PRTKAIIPVSLYGQCGDMDEV ken@titan ~/tmp/pm_11145943 $ cat WP_002208792.fasta >two:VFG000032(gb|WP_002208792) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP ken@titan ~/tmp/pm_11145943 $ cat WP_002208793.fasta >one:VFG000033(gb|WP_002208793) PGRTPGVERVAHRDDDEHSWHLYAIRIHPQAPLKCDDFIVRMTENGIGCSVHYVPLHLQP YWRDRYGLTPDMYPHSQAAFEGMASLPIYSRMTDADVQRVIASVRQLLRP ken@titan ~/tmp/pm_11145943 $

    — Ken

      Hi Ken,

      Thank you very much for your help!

      I totally agree with you! But in this case, since this particular data is not too big to handle easily, I didn't care very much about computational requirements... my fault.

Re: Extract multiple lists od Identifiers from a FASTA file
by joluito (Novice) on Aug 05, 2022 at 07:25 UTC

    The rude Anonymous person nailed it about changing the field separator.

    Here's is the FUNCTIONAL code (may indentation not be that good, and the code not obfuscated enough as to win a contest...but we need it to be mantained by different people, some of who are new to programming).

    #!usr/bin/perl -w use strict; use Getopt::Long; #usage example: perl /path_to_script/bay_seqret.pl -p /path_to_lists - +f /path_to_fasta/FASTA_file.faa my ($path, $fasta); GetOptions( 'path=s' => \$path, 'fasta=s' => \$fasta, ); chdir $path or die "ERROR: Unable to enter $path: $!\n"; opendir (TEMP , "."); my @files = readdir (TEMP); closedir TEMP; my $name; my $found=0; my $totalist=0; $/ = "\>";###CHANGED HERE TO SPLIT FASTA by ">" open (INFILE, "$path/$fasta") || die ("cannot open input file"); chomp(my @data = <INFILE>); print"@data\n"; $/ = "\n";####"REGENERATED" HERE NOT TO INTERFERE WITH THE REST OF THE + DATA PROCESSING for my $file (@files) { if($file=~/(\w+)\.txt/){ $name = "$1"; print"Lista: $path/$name.txt\n"; open (INFILE2, "$path/$name.txt") || die ("cannot open input f +ile"); chomp(my @lista = <INFILE2>); open OUT,'>'."$path/$name.out" or die "ERROR: Unable to open +$file $!\n"; foreach my $li (@lista){ chomp $li; print"$li\n"; $totalist++; for(@data){ if(/$li/){ $found++; print ">"."$_\n"; print OUT ">"."$_\n"; } } } print "For $name we found $found of a total $totalist\n"; } }
Re: Extract multiple lists od Identifiers from a FASTA file
by mr_mischief (Monsignor) on Aug 04, 2022 at 20:01 UTC

    Never mind any of the following. I didn't notice after reading your description that you're setting new output names for new files over and over.

    Are you sure you're getting the first sequence and not the last?

    open OUT,'>'."$path/$name.out" or die "ERROR: Unable to open $file $! +\n";

    open says about the MODE:

    If MODE is ">", the file is opened for output, with existing files first being truncated ("clobbered") and nonexisting files newly created. If MODE is ">>", the file is opened for appending, again being created if necessary.

    I would guess after looking a little closer at that line and that documentation you'll realize what's happening in your loop. There are other issues people have mentioned and even hammered upon, but truncating a file over and over is not going to get you cumulative output.

      Thank you for your insight mr.mischief!

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11145943]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2024-04-24 07:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found