in reply to Re^3: Use of uninitialized value in string eq
in thread Use of uninitialized value in string eq

Thank you John and ikegami. As I was afraid, it seems that I do still have difficulties with the basics of Perl (or rather of programming).

So the problem seems to have started when I first read the sequence line as $info{$gi}=$line;.

And then, @temp = $info{$humangi} operates as such $temp[0] = $info{$humangi}. I hope I get this right.

Thus, a possible solution is to store the sequence into an array in the beginning. To be frank, I do not know how to do this. Alternatively, there might be a way to convert a scalar value to an array list. (I would use split but as I see from your last comment on the use of split function, this is not a good way to do it.

Can you help me on this one?

Replies are listed 'Best First'.
Re^5: Use of uninitialized value in string eq
by chromatic (Archbishop) on Apr 22, 2010 at 07:40 UTC
    Thus, a possible solution is to store the sequence into an array in the beginning. To be frank, I do not know how to do this.

    If you show us some lines of data and explain how to identify individual components in a sequence (such that we know which is the 82nd, for example), we can explain how to write that code.

      Thanks guys. Sure I can provide some sample.

      A sample of the input file >gi|269849759|sp|P04637.4|P53_HUMAN MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA +.... >gi|143235535|dbj|46784|P53_MUS ----MEEPQSDPSVEPPL---SQETFSDLWKLL---PENNV---LSPLPSQAMDDLMLS....

      Above is the sample of the input file which actually contains 20 different such sequences. The first sequence in the sample refers to the protein in human while the second refers to the same protein in mouse. I have another 18 species for which sequences for the same protein exist.

      A sample of the second input file P82L R112S A116V A116T [...]

      This second input file contains a relatively long series of disease-associated variants. That is, P82L implies that if, within the protein sequence in human, there is a change at the 82nd position (first letter in the sequence refers to the first position) from P to L, we know that this substitution is associated with some disease, say breast cancer.

      I am only interested in human diseases. So the positions in the variants refer to the protein sequence of human species. What I would like to do is to check whether the sequences of the same protein from other species have L (which is a pathogenic state to humans) at their respective 82nd position.

      GI numbers are the unique identifiers for this protein sequence in that particular species.

      Thus, the algorithm that I would like to have looks like this:

      1. Construct a hash with identifiers (i.e., GIs) and respective sequences.

      2. Figure out which sequence belongs to the human species

      3. Get a variant and split it into three parts -- source, position and sink.

      4. Check the human sequence for the confirmation of this variant -- check whether the human sequence indeed contains P (healthy state) at the 82nd position

      5. Check all the rest of sequences to end up with a list of letters that these sequences contain at their 82nd position

      6. Of interest, note down if any of these letters match the sink (i.e., L) which is a pathogenic state.

      Let me write down a short example.

      >gi|269849759|sp|P04637.4|P53_HUMAN MEEPQSAQWCCTV >gi|143235535|dbj|46784|P53_MUS --FKQSARTYC-V >gi|343556|emb|3432384|P53_EQUUS --LKQSAPTY--S and the variant list contains: E3P Q8R T12C

      Thus, the desirable output should look like the following:

      Variant Deviations Pathogenic_Deviations E3P F (143235535), None L (343556) Q8R P(343556) R (143235535) T12C None None

      This explanation took a bit long, sorry for that.

Re^5: Use of uninitialized value in string eq
by ikegami (Patriarch) on Apr 22, 2010 at 17:06 UTC
    I did not present a solution because it's not clear what you are trying to do. Giving us input samples and the output you expect to get from those inputs would be useful.