twaddlac has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks!

I am trying to write a quality file and I am getting an error (as follows) and I'm not sure how to fix it. Along with the error report I have test data. Let me know if you have any suggestions! Thank you very much!!

# test data 37 37 37 37 37 37 37 37 37 37 37 40 40 40 40 40 40 40 40 40 40 40 40 4 +0 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 4 +0 40 40 40 40 39 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 3 +7 35 35 35 35 35 35 30 30 30 28 28 28 30 30 32 32 35 37 37 37 37 37 3 +7 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 3 +5 35 35 37 37 34 32 32 32 32 32 32 32 30 16 15 15 15 24 27 26 26 13 1 +3 13 23 23 13 13 13 17 21 21 21 21 21 19 21 16 18 17 19 18 # error report --------------------- WARNING --------------------- MSG: seq doesn't validate with [0-9A-Za-z\*\-\.=~\\/\?], mismatch is +, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , +, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , +, , , , , , , , , , , , , , , , , , --------------------------------------------------- ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Attempting to set the sequence to [37 37 37 37 37 37 37 37 37 37 +37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 +40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 37 37 37 37 +37 37 37 37 37 37 37 37 37 37 37 37 37 37 35 35 35 35 35 35 30 30 30 +28 28 28 30 30 32 32 35 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 +37 37 37 37 37 37 37 37 37 37 37 37 37 35 35 35 37 37 34 32 32 32 32 +32 32 32 30 16 15 15 15 24 27 26 26 13 13 13 23 23 13 13 13 17 21 21 +21 21 21 19 21 16 18 17 19 18] which does not look healthy STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/ +Root.pm:357 STACK: Bio::PrimarySeq::seq /usr/lib/perl5/site_perl/5.8.8/Bio/Primary +Seq.pm:270 STACK: Bio::PrimarySeq::new /usr/lib/perl5/site_perl/5.8.8/Bio/Primary +Seq.pm:221 STACK: Bio::LocatableSeq::new /usr/lib/perl5/site_perl/5.8.8/Bio/Locat +ableSeq.pm:109 STACK: Bio::Seq::Meta::Array::new /usr/lib/perl5/site_perl/5.8.8/Bio/S +eq/Meta/Array.pm:167 STACK: Bio::Seq::Quality::new /usr/lib/perl5/site_perl/5.8.8/Bio/Seq/Q +uality.pm:191 STACK: tagfinder.pl:257

Replies are listed 'Best First'.
Re: .qual File Writing Complication
by ww (Archbishop) on Jul 08, 2010 at 16:50 UTC
    What is a .qual file? (not that we really need to know, but it would be a nice touch for non-bio monks)
    Where is your code?

      A .qual file is a numerical representation of the confidence of a nucleotide sequencer. Used by the modern, automated sequencers, they assign a 2 digit value to each nucleotide ACTGU (which are what all DNA is composed of) representing the accuracy of the sequencer guessing at that particular nucleotide (in my case, 40 being = 100% confidence). For example: if I were a DNA sequencer and I KNOW that the first three nucleotides are ACT but I'm only partially confident that the last nucleotide is a G. The .qual file would look something like this: 40 40 40 34. I hope this makes sense! Let me know if you need/would like further explanation!



      here's my code:
      #!/usr/bin/perl -w use strict; use warnings; use Bio::Seq; use Bio::SeqIO; use Bio::Seq::Quality; my $seq_count = 0; my $used_seqs; my $match; my $front; my $end; my $front_tag = 0; my $end_tag = 0; my $nfront = 0; my $nend = 0; my $no_match = 0; my $seq_file = "fastaTestData.fna"; my $qual_file = "qualTestData.qual"; my $seqIn_obj = Bio::SeqIO->new(-file => "$seq_file", -format => 'fast +a') or die "Can't access $seq_file\n"; #creates the fasta output file print "creating fasta output file\n"; my $fasta = Bio::SeqIO->new(-file => '>testFasta.fasta', -format => 'f +asta'); # creates new qual output file print "creating quality output file\n"; my $quality = Bio::SeqIO->new(-file => '>testQual.qual', -format => 'q +ual'); #reads in the tag file as a fasta file print "Reading in tag file\n"; my $tag_obj = Bio::SeqIO->new(-file => "mids.fa", -format => 'fasta') +or die "Can't access mids.fa\n"; #adds the tag id and sequence into a hash print "creating tag hash\n"; my %tag_hash = (); while(my $tag = $tag_obj->next_seq){ # instantiates the hash map, using the tag id as the key and the t +ag sequence as the value. $tag_hash{$tag->id} = $tag->seq; } # reads in the quality files associated with the fasta files. print "reading in qual file\n"; my $qual_obj = Bio::SeqIO->new(-file => "$qual_file", -format => "qual +") or die "Can't access $qual_file"; print "creating qual hash\n"; my %qual_hash = (); while(my $qual_seq = $qual_obj->_next_qual){ # next line is pushing a reference to an array onto an array of sc +alar array references $qual_hash{$qual_seq->id()} = $qual_seq->qual(); } # reads in the next sequence (next_seq) and stores it in $seqIn_obj th +en prints it out. print "Applying regexs\n"; while (my $seq = $seqIn_obj->next_seq){ print "getting new sequence\n"; $seq_count += 1; $match = 0; $front = 0; $end = 0; my $temp_seq = $seq->seq; while((my $id, my $tag_seq) = each(%tag_hash)){ print "cycling through tag hash\n"; # reverses the tag_seq my $rtag_seq = reverse $tag_seq; for(my $i = 0; $i <= 1; $i += 1){ if($temp_seq =~ s/^$tag_seq//i){ print "removing tag\n"; $match += 1; $front += 1; $front_tag = 1; } if($temp_seq =~ s/^$rtag_seq//i){ print "removing tag\n"; $match += 1; $front += 1; $front_tag = 1; } if($temp_seq =~ s/$tag_seq$//i){ print "removing tag\n"; $match += 1; $end += 1; $end_tag = 1; } if($temp_seq =~ s/$rtag_seq$//i){ print "removing tag\n"; $match += 1; $end += 1; $end_tag = 1; } } } if($front > 1){ $nfront += 1; } if($end > 1){ $nend += 1; } if($match == 0){ $no_match += 1; } if($front <= 1 && $end <= 1 && $match != 0){ print "removing 6 nts and amending qual sequence\n"; $used_seqs += 1; my @qual = @{$qual_hash{$seq->id()}}; print ("@qual\n"); my $qual_string = join(" ", @qual); print "$qual_string\n"; $qual_string =~ s///; print "$qual_string\n"; if($front_tag == 1){ $qual_string =~ s/^\w{32}//; $temp_seq =~ s/^\w{6}//; $front_tag = 0; } elsif($end_tag == 1){ $qual_string =~ s/\w{32}$//; $temp_seq =~ s/\w{6}$//; $end_tag = 0; } # creates the new sequence and passes the metadata to the sequ +ence object print "creating fasta seq and writing to file\n"; my $new_seq = Bio::Seq->new(-seq => $temp_seq, -display_id => +$seq->id, -length => $seq->length, -desc => $seq->desc, -alphabet => $seq->alphabet); # writes the new sequence object to the specified file $fasta->write_seq($new_seq); print "creating qual seq and writing to file\n"; my $new_qual = Bio::Seq::Quality->new(-seq => $qual_string, -i +d => $seq->id(), -desc => $seq->desc); $quality->write_seq($new_qual); } } print("nfront = $nfront\nnend = $nend\nno_match = $no_match\nseq_count + = $seq_count\nused_seqs = $used_seqs\n");


      Feel free to ask if anything doesn't make sense (for I may have left something out from the copy and paste), if there are any hints to make it more efficient etc.. most importantly, to fix the error stated previously! Thanks again!
        #!??? How does one leave "something out from the copy and paste?" :-)

        No syntax errors found. Warnings in OP appear to come from Bio:xxx; not from Perl.

        Based on the warning MSG: seq doesn't validate with [0-9A-Za-z\*\-\.=~\\/\?] and the fact I can't find any attempt in your code to "validate" a "$seq" with such a character class nor even a ($temp_seq I have to wonder: could your fastaTestData.fna be corrupt?

        Fixed unclosed code tag

        Perhaps it's just naivete on my part (I'm not at all familiar with "try/catch/throw exception" style error handling), but I find it odd that your error-report stack trace ends with this:
        STACK: tagfinder.pl:257
        which would suggest to me that the your script that uses those Bio modules and throws that stack trace (i.e. the code that you wrote yourself) should have at least 257 lines in it. The thing that's odd about it is that the code you posted here isn't that long -- it's only 172 lines.

        So I don't think it makes a lot of sense to look for the error in the code you posted.

        Apart from that, I wonder if maybe you are getting your files confused -- like maybe you are passing your file full of numbers to a function that is expecting a file full of nucleotide letters, or something like that.

        Anyway, ask yourself whether you can break things up into smaller components, identify the particular piece that is causing trouble, and present just a small snippet of code and data to demonstrate the problem -- something that can be run by anyone who has the necessary modules installed (which wouldn't include me, I'm afraid). Good luck with that...

        I looked up your modules on CPAN and I see you're using the BioPerl package. There are 860 modules, 84 utility scripts, and a dozen or more documents. I don't know about the rest of y'all, but I'm not going to install and search the files. :-)

        From what I can see, Bio::Root::Exception is designed to throw exceptions, and the stack trace seems to be saying that it began with tagfinder.pl. How does that relate to the code you posted?

        I would guess that one of the many BioPerl modules is attempting to validate your data and failing. But we don't know where in your script it's failing or how you're running the program. In fact, I notice you've got print statements throughout your code, but there are none in your original post; so I can only guess that you didn't post the entire output. Is that true? If so, then what do you expect us to do? Even if someone here knows BioPerl and decides to respond, you still haven't given them enough to go on.

        Care to try again?

        --marmot

        Just to note, as others have indicated putting the code here that generated that stack trace does help to trace the problem down. Looks like the problem arises when you are creating a new Bio::Seq::Quality, on line 165 of the code you submitted.

        According to the Bio::Seq::Quality you are using the wrong named attribute for the quality information:

        my $new_qual = Bio::Seq::Quality->new(-seq => $qual_string, -id => $seq->id(), -desc => $seq->desc);

        That should be:

        my $new_qual = Bio::Seq::Quality->new(-qual => $qual_string, -id => $seq->id(), -desc => $seq->desc);

        This class also accepts sequence data to go with those scores; not sure if it requires it, though.

Re: .qual File Writing Complication
by BrowserUk (Patriarch) on Jul 09, 2010 at 06:42 UTC
    seq doesn't validate with [0-9A-Za-z\*\-\.=~\\/\?], mismatch is , , , + , , +, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , , , , + , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , +, , , , , +, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , , , , + , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

    Hm. Seems fairly obvious given the comprehensive feed back. You are attempting to write a ".qual file" with APIs intended for writing FASTA files. FASTA sequences do not contain spaces. (Apparently) ".qual" data does.

    That's what the regex defines and all the commas above are. They are delimiting the errant spaces between your qual data values.

    The solution: Don't use O'Woe engineered modules.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: .qual File Writing Complication
by cjfields (Novice) on Jul 11, 2010 at 20:49 UTC
    Questions for BioPerl should be directed to the BioPerl mailing list, which is indicated in the BioPerl documentation. Hard to say what's occurring there w/o code, though (can't tell how you're trying to create the Bio::Seq::Quality object).
      While this is an option it is not a 'should', there are many monks in here who know BioPerl so we are just opening another gate annexed to the mailing list you are speaking about...

      All Perl related questions are welcome and BioPerl is Perl wrapped up in modules....

        True, this is an option and isn't meant to slight Perlmonks. My point is the code documentation for this module, as well as all BioPerl modules, indicates where questions for BioPerl are normally directed. BioPerl developers (of which I am one) might be able to respond more directly there, and one might even reach the developer responsible for the code in this case.