Re: Could not open

I remember trying to help you with an earlier post about this project of yours, but it looks like (a) my suggestions weren't clear enough to you, and/or (b) even if they were, you didn't pay attention to them. I'll try again.

First, a couple pointers that can help make your code easier to read (and write):

Stop reading parameters from the user via STDIN while the script is running, and start using @ARGV to get those params from the command line. Better for the programmer, and better for the user. Trust me on that.
Now that you're using "strict", start declaring your variables at the points (in the blocks) where they are actually used, instead of declaring them all at the start and giving them all global scope. And you can initialize a variable to some value at the same time you declare it with "my".

(I'll demonstrate those ideas below.)

Now the big issue seems to be that you don't really have an appropriate algorithm. Big red flags include:

you go into a complicated arrangement of nested "while" loops that all read from the same file
you try to rewind the file to overcome the problem of reading too far within a nested while loop (and it looks like your guessing about how far you should rewind)
you open two output files within the main while loop, and after writing strange patterns of text to those and leaving the while loop, you then open them for reading and load their full contents into arrays -- when you should just be storing these contents in arrays in the first place, and not writing this stuff to files at all.
you repeat lines of code that don't need to be repeated: "if this sequence is above threshold, write to ORG and write to INPUT; else write to ORG" -- bleah.
the first time you run fasta34, it seems like "$seqfile" (a value supplied by the user) refers to something that already exists and fasta34 uses it as input; later, when you run fasta34 inside the final while loop, you're passing a file named "querry.aa", which is still being written to in the same while loop that runs fasta, which is using that file as input -- this makes no sense.

So I'd like see if I can figure out what sort of process you're really trying to achieve here, and see if I can propose an algorithm that is appropriate. It would be a lot easier if you could supply samples for just a couple records from the fasta34 output -- they don't even need to be complete records: if each sequence is 50 lines long, it would be good enough to trim away the middle 45 lines.

Still, here's a pseudo-code description of what I think you should be trying to do. Based on my minimal knowledge about fasta files, I'll assume that a double angle bracket ">>" marks the start of a new sequence record, so let's use perl's "input record separator" variable to read a whole multiline record at once:

- get the user-specified sequence file, library file and cut-off perce
+ntage
- run fasta34.exe with the sequence file and library file
- open the file that was output by fasta34
- set INPUT_RECORD_SEPARATOR ($/) to ">>"
- reading each record of the fasta34 output file:
 -- the first read will contain just ">>", so skip it
 -- match /(\d+\.\d+)% identity/ and save the captured string as $per
 -- push the record onto an "org" array
 -- if $per > cut-off value, push the record onto an "input" array

# You now have all the original fasta output in @org, and
# all the "above-threshold" records in @input.  At this point,
# you want to run fasta again, but I can't figure out what
# input you want to give it, how many times you really need
# to run it, or how you should use the subsequent output.
[download]

Can you possibly explain this last stage in clear, basic steps? To get you started, here's that first set of pseudo-code, converted to untested perl-code:

#!/usr/bin/perl

use strict;
use warnings;

chdir "c:/perl/sam" or die "can't chdir to c:/perl/sam: $!";

# get the user-specified sequence file, library file
# and cut-off percentage

my ( $seqfile, $libfile, $cutoff ) = @ARGV;
die "Usage: $0 seq.file lib.file cutoff\n"
    unless ( @ARGV == 3 and -f $seqfile and -f $libfile
             and $cutoff =~ /^\d+\.?\d*$/ );

# run fasta34.exe with the sequence file and library file

my $main = "main.fasta";
my $result = system("fasta34.exe -O $main -Q $seqfile $library");
if ($result >> 8) {
    warn "fasta34.exe ended with non-zero exit status \n";
}

# open the file that was output by fasta34

open(FASTA,"$main") or die "cant open fasta34 output file: $!";

$/ = '>>';  # set INPUT_RECORD_SEPARATOR to ">>"

while (<FASTA>)
{
    chomp;  # this removes ">>" from the end of the string

    next if ( /^\s*$/ );  # skip the first read (update: decided not t
+o test length())

    ($per) = /(\d+\.\d+)% identity/;

    push @org, ">>$_";
    push @input, ">>$_" if ($per > $cutoff);
}
close FASTA;

# if you want to save @input and/or @org to a file and run fasta34 aga
+in:

open OUT, ">nextinput.fasta" or die $!;
print OUT join '', @input;
close OUT;

# and likewise for @org
[download]

Try that out as a start and see where it might take you as you figure out what the next step should really be.

update: I just looked back at your ealier post on this project, so let's see if I understand better:

By running fasta34 a second time on just the "above-threshold" records, you're hoping that it will output new records that were not produced by the original run, and you want to assemble all the distinct records into one set.

If that's the case, then the "@org" array should really be a hash; you can use the whole record string as the hash key, or (if each record is really large) you can "digest" it using Digest::MD5, then use the resulting fixed-width MD5 "signature" as the hash key and the record itself as the hash value.

Then after running fasta34 a second (third, ...) time, read the new output file one record at a time and assign each one to %org in the same way as you did on the first run; if a given record has been seen before, you're just re-assigning the same hash element, and when it's new, you're adding a new hash element. When you're done with the last run, %org will contain all the distinct records produced by all runs, and you just need to print it to a file.

(I hope that's clear.)

Comment on Re: Could not open Select or Download Code

Replies are listed 'Best First'.
Re^2: Could not open by FarTech (Novice) on Feb 04, 2005 at 22:35 UTC
Thanks graff for your valuable input My main problem is running fasta34 for the second and the consecutive time i extract the input of sequence each time thru the following code.The query.aa store just one sequence at a time. Read more... (2 kB)	[reply] [d/l] [select]
Re^3: Could not open by graff (Chancellor) on Feb 05, 2005 at 18:33 UTC
My main problem is running fasta34 for the second and the consecutive time This is probably because your perl script is creating the data file that is being used as input in the second and consecutive runs of fasta34, and that data file is being formatted incorrectly. In order to debug this, try modifying your script so that it does the first fast34 run, processes the output of that run, writes the input file for the next run, and then exits. Run that version of the script, then compare the original fasta input file to your script's output file (the one that would be used as input for the next fast34 run). If there are differences in how the data are formatted, you need to figure out why your script is producing a different format, and fix that. Try manually running fasta34 at the command line, using your script's output as the fasta input, and see what happens. Maybe it will give you a useful error message. If your initial fasta output file is okay, and you still have trouble doing the second and subsequent fasta runs, then the problem is probably the flow-control in the later part of your script -- the looping and/or conditional statements that guide the later fasta34 runs might not be doing what you think/want. For that, use the perl debugger (start the script using "perl -d script_name arg1 arg2 arg3") and step through that part of the logic. If you haven't used the perl debugger, it's easy to learn and quite valuable to use; see "perldoc perldebug". you also mentioned about hash, can u brief me out how we can store such file in hash and then compare similar files >GTM1_RAT GLUTATHIONE S-TRANSFERASE YB1 (EC 2.5.1.18) ( (217 aa) WFAGDKVTYVDFLAYDILDQYHIFEPKCLDAFPNLKDFLARFEGLKKISAYMKSSRYLST PIFSKLAQWSNK >GTMU_CRILO GLUTATHIONE S-TRANSFERASE Y1 (EC 2.5.1.18) (217 aa) FAGDKVTLCGFLAYDVLDQYQMFEPKCLDPFPNLKDFLARFEGLKKISAYMKTSRFLRRP IFSKMAQWSNK >GTM1_HUMAN GLUTATHIONE S-TRANSFERASE MU 1 (EC 2.5.1.18 (217 aa) LPEKLKLYSEFLGKRPWFAGNKITFVDFLVYDVLDLHRIFEPKCLDAFPNLKDFISRFEG LEKISAYMKSSRFLPRPVFSKMAVWGNK >GLNA_ANASP GLUTAMINE SYNTHETASE (EC 6.3.1.2) (GLUTAMAT (473 aa) SLELALEALENDHAFLTDTGVFTEDFIQNWIDYKLANEVKQMQLRPH-PYEFSIYYDV >GTM4_HUMAN GLUTATHIONE S-TRANSFERASE MU 4 (EC 2.5.1.18 (218 aa) LPTMMQHFSQFLGKRPWFVGDKITFVDFLAYDVLDLHRIFEPNCLDAFPNLKDFISRFEG LEKISAYMKSSRFLPKPLYTRVAVWGNK [download] If this represents a valid data file for fasta34, then I would guess that the file format could be described as follows: each record begins with ">", ends with a blank line, and contains one or more lines of data characters in between each line of data characters begins with a space followed upper case letters and hyphens (and possibly a final space) the initial line of each record contains identifying information; this information is unique to each distinct set of data characters, so if two records contain identical data characters then their initial lines should also be identical (and vice-versa). But I'm just guessing about that. If you have documentation about the file format, use that instead of relying on my guesses (or if you have no documentation, test the data to see if these conditions are true). Here's an example of a script that would test my assumptions about the data format; it shows how to use a hash to keep track of records as they are read in; if your store this script in a file called "test-format.pl", and have a set of fasta data files named ".fasta", you can run this script with the command line "perl test-format.pl .fasta": #!/usr/bin/perl # Check fasta file format, report cases that don't meet expectations use strict; use warnings; my %records; $/ = ''; # special Perl rule: setting $/ to empty string # means: treat each sequence of one or more blank lines # as a record separator while (<>) { my @lines = split /[\r\n]+/; # split on line-breaks my $key = shift @lines; # first line should be the identifier if ( $key !~ /^>/ ) { warn "record $. does not start with '>':\n$_\n"; } my $data = ''; for my $line ( @lines ) { warn "record $. contains odd data:\n$_\n" unless ( $line =~ /^ + [-A-Z]+\s$/ ); $data .= $line."\n"; # (remember to put linefeeds back) } if ( exists( $records{$key} )) # have we seen this identifyier b +efore? { warn "key $key found on different data:\n$data\n\n$records{$ke +y}\n\n" if ( $records{$key} ne $data ); } else # we haven't seen this value of $key before { $records{$key} = $data; } } printf STDERR "%d records checked\n", scalar keys %records; # just for fun, let's try to output some data: for my $chosen ( grep /^>GTM1_/, keys %records ) { print join "\n", $chosen, $records{$chosen}, ''; } [download] It prints warning messages to STDERR if my assumptions turn out to be wrong for a given set of input. Try it out in order to learn more about your data. You can also use redirection on the command line to save the sample output data: `perl test-format.pl .fasta > test.out` [download] Then see if test.out works as input to fasta34; if it doesn't, then I expect the sample file data you provided won't work either, and you'll need to figure out why.	[reply] [d/l] [select]