in reply to Re^4: Use of uninitialized value in string eq
in thread Use of uninitialized value in string eq

Thus, a possible solution is to store the sequence into an array in the beginning. To be frank, I do not know how to do this.

If you show us some lines of data and explain how to identify individual components in a sequence (such that we know which is the 82nd, for example), we can explain how to write that code.

  • Comment on Re^5: Use of uninitialized value in string eq

Replies are listed 'Best First'.
Re^6: Use of uninitialized value in string eq
by sophix (Sexton) on Apr 22, 2010 at 23:41 UTC
    Thanks guys. Sure I can provide some sample.

    A sample of the input file >gi|269849759|sp|P04637.4|P53_HUMAN MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA +.... >gi|143235535|dbj|46784|P53_MUS ----MEEPQSDPSVEPPL---SQETFSDLWKLL---PENNV---LSPLPSQAMDDLMLS....

    Above is the sample of the input file which actually contains 20 different such sequences. The first sequence in the sample refers to the protein in human while the second refers to the same protein in mouse. I have another 18 species for which sequences for the same protein exist.

    A sample of the second input file P82L R112S A116V A116T [...]

    This second input file contains a relatively long series of disease-associated variants. That is, P82L implies that if, within the protein sequence in human, there is a change at the 82nd position (first letter in the sequence refers to the first position) from P to L, we know that this substitution is associated with some disease, say breast cancer.

    I am only interested in human diseases. So the positions in the variants refer to the protein sequence of human species. What I would like to do is to check whether the sequences of the same protein from other species have L (which is a pathogenic state to humans) at their respective 82nd position.

    GI numbers are the unique identifiers for this protein sequence in that particular species.

    Thus, the algorithm that I would like to have looks like this:

    1. Construct a hash with identifiers (i.e., GIs) and respective sequences.

    2. Figure out which sequence belongs to the human species

    3. Get a variant and split it into three parts -- source, position and sink.

    4. Check the human sequence for the confirmation of this variant -- check whether the human sequence indeed contains P (healthy state) at the 82nd position

    5. Check all the rest of sequences to end up with a list of letters that these sequences contain at their 82nd position

    6. Of interest, note down if any of these letters match the sink (i.e., L) which is a pathogenic state.

    Let me write down a short example.

    >gi|269849759|sp|P04637.4|P53_HUMAN MEEPQSAQWCCTV >gi|143235535|dbj|46784|P53_MUS --FKQSARTYC-V >gi|343556|emb|3432384|P53_EQUUS --LKQSAPTY--S and the variant list contains: E3P Q8R T12C

    Thus, the desirable output should look like the following:

    Variant Deviations Pathogenic_Deviations E3P F (143235535), None L (343556) Q8R P(343556) R (143235535) T12C None None

    This explanation took a bit long, sorry for that.