I remember trying to help you with an earlier post about this project of yours, but it looks like (a) my suggestions weren't clear enough to you, and/or (b) even if they were, you didn't pay attention to them. I'll try again.

First, a couple pointers that can help make your code easier to read (and write):

(I'll demonstrate those ideas below.)

Now the big issue seems to be that you don't really have an appropriate algorithm. Big red flags include:

So I'd like see if I can figure out what sort of process you're really trying to achieve here, and see if I can propose an algorithm that is appropriate. It would be a lot easier if you could supply samples for just a couple records from the fasta34 output -- they don't even need to be complete records: if each sequence is 50 lines long, it would be good enough to trim away the middle 45 lines.

Still, here's a pseudo-code description of what I think you should be trying to do. Based on my minimal knowledge about fasta files, I'll assume that a double angle bracket ">>" marks the start of a new sequence record, so let's use perl's "input record separator" variable to read a whole multiline record at once:

- get the user-specified sequence file, library file and cut-off perce +ntage - run fasta34.exe with the sequence file and library file - open the file that was output by fasta34 - set INPUT_RECORD_SEPARATOR ($/) to ">>" - reading each record of the fasta34 output file: -- the first read will contain just ">>", so skip it -- match /(\d+\.\d+)% identity/ and save the captured string as $per -- push the record onto an "org" array -- if $per > cut-off value, push the record onto an "input" array # You now have all the original fasta output in @org, and # all the "above-threshold" records in @input. At this point, # you want to run fasta again, but I can't figure out what # input you want to give it, how many times you really need # to run it, or how you should use the subsequent output.
Can you possibly explain this last stage in clear, basic steps? To get you started, here's that first set of pseudo-code, converted to untested perl-code:
#!/usr/bin/perl use strict; use warnings; chdir "c:/perl/sam" or die "can't chdir to c:/perl/sam: $!"; # get the user-specified sequence file, library file # and cut-off percentage my ( $seqfile, $libfile, $cutoff ) = @ARGV; die "Usage: $0 seq.file lib.file cutoff\n" unless ( @ARGV == 3 and -f $seqfile and -f $libfile and $cutoff =~ /^\d+\.?\d*$/ ); # run fasta34.exe with the sequence file and library file my $main = "main.fasta"; my $result = system("fasta34.exe -O $main -Q $seqfile $library"); if ($result >> 8) { warn "fasta34.exe ended with non-zero exit status \n"; } # open the file that was output by fasta34 open(FASTA,"$main") or die "cant open fasta34 output file: $!"; $/ = '>>'; # set INPUT_RECORD_SEPARATOR to ">>" while (<FASTA>) { chomp; # this removes ">>" from the end of the string next if ( /^\s*$/ ); # skip the first read (update: decided not t +o test length()) ($per) = /(\d+\.\d+)% identity/; push @org, ">>$_"; push @input, ">>$_" if ($per > $cutoff); } close FASTA; # if you want to save @input and/or @org to a file and run fasta34 aga +in: open OUT, ">nextinput.fasta" or die $!; print OUT join '', @input; close OUT; # and likewise for @org
Try that out as a start and see where it might take you as you figure out what the next step should really be.

update: I just looked back at your ealier post on this project, so let's see if I understand better:

By running fasta34 a second time on just the "above-threshold" records, you're hoping that it will output new records that were not produced by the original run, and you want to assemble all the distinct records into one set.

If that's the case, then the "@org" array should really be a hash; you can use the whole record string as the hash key, or (if each record is really large) you can "digest" it using Digest::MD5, then use the resulting fixed-width MD5 "signature" as the hash key and the record itself as the hash value.

Then after running fasta34 a second (third, ...) time, read the new output file one record at a time and assign each one to %org in the same way as you did on the first run; if a given record has been seen before, you're just re-assigning the same hash element, and when it's new, you're adding a new hash element. When you're done with the last run, %org will contain all the distinct records produced by all runs, and you just need to print it to a file.

(I hope that's clear.)


In reply to Re: Could not open by graff
in thread Could not open by FarTech

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.