perl task3

huzefa has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: perl task3 by ELISHEVA (Prior) on Mar 07, 2009 at 21:43 UTC
Providing the data makes a big difference and now it is much clearer why your script is having problems: you are tying yourself in knots trying to treat each line as a separate input record. This makes you think that a single record split between two lines is two records: a "header" record followed by a "data" record. In actual fact newlines do not define records: ">" does. The newline is just whitespace that can be removed from the record. Nor is there a header record - the data you are calling a header is just one of many fields. You are ignoring the labels for data provided in each field rather than using them to their fullest. You never check for undefined values. Rather you seem to assume that fields will always have defined values, when in fact, they sometimes don't. Then when you try to split an undefined sequence or include some other undefined field in a print statement, you get (surprise, surprise) -- warnings! There are a few other issues with regular expressions, variable name selection and your methods for formatting output, that deserve some attention. I've posted a revised version of your code with annotations - including how to choose an alternate record separator - see the documentation for `$/` in perldoc:://perlvar for more information. Hopefully, you can use this to avoid parsing problems in future code. use strict; use warnings; my %s2m = (A => 71.0371, C => 103.0092, D => 115.0269, E => 129.0426, F => 147.0684, G => 57.0215, H => 137.0589, I => 113.0841, K => 128.0950, L => 113.0841, M => 131.0405, N => 114.0429, P => 97.0528, Q => 128.0586, R => 156.1011, S => 87.0320, T => 101.0477, V => 99.0684, W => 186.0793, Y => 163.0633, '\s' => 0.0, "" => 0.0 ); #when you need to keep labels and columns lined up, using #printf with a format string is usually a better idea than #one long string - for the header %s is the labels and for #the rows %s are the values. Also you can use the format #string to set field widths - tabs aren't very reliable as #a way to line up columns because different users can have #different tab settings: my $OUTPUT_FMT= "%-35s%-8s%-15s%-3s%-3s%s\n"; printf $OUTPUT_FMT, 'Prot_name', 'peptide', 'mass-to-charge' , 'z', 'p', 'sequence'; #since we don't want complaints about undefined values, #we need to check for undefined values and provide a #string representation for the undefined value - see below #for how $MISSING is used my $MISSING='-'; #There is no law saying you have to split you input on a newline. #Since > marks the start of a new record, why not use that to #divide records. You can use $/ to choose a record separator $/='>'; #For complicated parsing it is usually best to use a named variable #to hold the record or line rather than rely on $_. while ( my $line = <DATA> ) { #when you use a named variable, you need to pass it to chomp #like this: chomp $line; #this is a more efficient way to skip empty lines than /^\s$/ #just look for at least one non-space character next unless $line =~ /\S/; #skip unless at least one non-space #unless white space around field and subfield separators is #an actual value, you should make whitespace part of #the separator - so /\s\\|\s/ instead of /\|/ my @aFields = split /\s\\|\s/, $line; #pipe is field separator #stripping whitespace from around separators will still leave #white space at start of first field and end of last field, #so strip it as a separate step if (scalar(@aFields)) { $aFields[0] =~ s/^\s(\S.)$/$1/; $aFields[-1] =~ s/^(.\S)\s$/$1/; } #each field consists of label:value - taking advantage of this #we can just put field values into a hash where the label is #the key my %hFields; foreach my $sField (@aFields) { my ($sLabel, $sValue) = split(/\s:\s/, $sField); $hFields{$sLabel} = $sValue; } #If a field is missing from a record, then attempts to pass #it to split, length, or printf will result in complaints #about undefined values, so before using anything either #check to see if it is defined and/or assign it a default #value #Note also, we check for "unless defined($hFields{Missed})" #and not just "unless $hFields{Missed}" #unless $var would also set $var=$MISSING if $var==0. $hFields{Protein}=$MISSING unless defined($hFields{Protein}); $hFields{Peptide}=$MISSING unless defined($hFields{Peptide}); $hFields{Missed}=$MISSING unless defined($hFields{Missed}); #To split a string into characters, use // not / /. #Also if you are calculating mass, why not use mass as a #variable name: it is much more self documenting #my $total = 0.0; #my @peptides = split (/ /, $sequence);#65 my $mass=0.0; if (defined($hFields{Seq})) { my @peptides = split(//, $hFields{Seq}); foreach my $peptide (@peptides) { $mass += ($s2m{$peptide}+18.0106) if defined $s2m{$peptide}; } } else { $hFields{Seq} = $MISSING; } printf $OUTPUT_FMT, $hFields{Protein}, $hFields{Peptide} , $mass, '1', $hFields{Missed}, $hFields{Seq}; } [download] Best, beth	[reply] [d/l] [select]
Re: perl task3 by CountZero (Bishop) on Mar 07, 2009 at 09:16 UTC
May be you could be so kind as to indicate which lines are "52,54,56,58.." so we do not have to count them ourselves? If it are the splits in the `foreach` loop, did you check if `$value` and `@elements` contain what you expect them to contain? As we do not know what is in `%sequences` it is difficult to guess, but perhaps `(my $key, my $value) = %sequences` [download] should be `(my $key, my $value) = each %sequences` [download] OP changed the code after I wrote my answer. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: perl task3 by ELISHEVA (Prior) on Mar 07, 2009 at 17:21 UTC
The first thing I notice looking at your code is that your method of populating the hash `%sequences` would appear to be inconsistent with your routines to parse the contents of the hash. When you read in the file, you count any line without spaces as a header and skip any lines composed entirely of spaces. That would mean that only lines with a mix of spaces and non-spaces would count as values. If you have no such lines, then your hash keys will always have undefined values. Consequently when you visit each key-value pair of the hash in the `foreach` statement) all of your `split` routines will be splitting an undefined value. Now I do note that you expect spaces within the peptide chains portion of non-header lines. But is each and every header row followed by a row with a non-empty value for the peptide chains with at least one space? Also it is a bit confusing that you are first splitting the non-header line (`$value`) using a colon ":" and then splitting the same non-header line using a pipe "\|". Which is the actual field separator? Or were you intending to split the non-header line into fields using a colon and then split each field into subfields using a pipe? If so you need to move the line splitting each field to inside the `foreach` loop. Of course, without any sample data it is hard to what non-header values are like. Perhaps you could provide some sample lines? A further side note: you would do well to use different names for `$value` in your inner and outer loops. This makes reading your code confusing. It is also possible that you yourself might sometimes intend the meaning in the outer loop (the entire non-header row terminated by ";") and sometimes the meaning in the inner loop (the subfields created by splitting the row using ":"). Best, beth	[reply] [d/l] [select]
Re: perl task3 by graff (Chancellor) on Mar 07, 2009 at 16:46 UTC
It looks like you updated the OP in response to the advice from CountZero -- you really should have added a comment to say that you made the update, so that it's easier for the rest of us to understand what is going on. You should have also made a comment about whether his suggestion made any difference in the problem. (Since you are now showing the line numbers in the code that cause the "uninitialized values" warnings, I guess you are still having a problem.) Another nice gesture would be to remove stuff from the OP code that has nothing to do with the problem -- e.g. we don't need to see those nonsense lines at the end that are commented out, and the "%s2a" hash that you declare is never used, so why have it there? Providing some sample data would help a lot, too. Instead of showing the file name that you open for reading (which doesn't help us at all, since we don't have that file), you could use the "DATA" file handle, and include just enough data at the end of your posted code, like this: `... while (<DATA>) ... __DATA__ >some:header\|text\|foo:bar\|blah WHATEVER:DATA\|WOULD:DEMONSTRATE\|THE:PROBLEM ...` [download] As for your lines that are causing the warnings (in the version of the OP that I saw), here are some hints: foreach $value (@e) { my @elements= split(/\\|/, $value); ... # next 2 lines assume that $value had at least 6 strings # separated by "\|" (but maybe it only had 2): my @missedc = split (/:/, $elements[2]);#60 ... my @s = split(/:/, $elements[5]);#62 my $sequence = $s[1]; # when the assignment to @s warns and fails (because # @elements < 6) the next line will also warn and fail: my @peptides = split (/ /, $sequence);#65 ... # and so will this line: my $output = "$headers[1]\t\t$peptide\t\t$total \t\t 1 \t\t$mi +ssed\t$sequence\n";#72 ... } [download] So the point is, the various warnings are all related, and the source of the problem is probably in your data (or in your untested assumptions about the data).	[reply] [d/l] [select]
Re: perl task3 by CountZero (Bishop) on Mar 07, 2009 at 16:35 UTC
huzefa, You have significantly changed your code since you posted your original question. Now the answer I gave does not make much sense anymore. But as you are a new member, we will forgive you for this. Note however that in order to get the most out of your experience in the Monastery, updating your question or code without indicating what you changed is generally frowned upon (and for good reason). Also try to limit your code to the smallest part that exhibits the problem and do it in such a way that it will run on its own without having to rely on external files. For all we know, your "P1GroupCExercise2_trypsin.txt" is empty or might contain one of Shakespeare's sonnets. Adding some sample data at the end of your code in a `__DATA__` section will be much better and adding a comment at the end of the lines where the errors / warnings occur will be considered "good style" and will get you much XP. If you can do that we will gladly look once more at your problem. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re: perl task3 by Bloodnok (Vicar) on Mar 07, 2009 at 12:33 UTC
CountZero++ - you're my hero for the day for even attempting to interpret the code:-D huzefa, I take did you didn't check the format of the submitted posting...try putting a </code> tag at the end of the code...you might then gain the benefit of the wisdom of a few more monks - once they can read the code. A user level that continues to overstate my experience :-))	[reply]