Re^5: Use of uninitialized value in string eq

Replies are listed 'Best First'.
Re^6: Use of uninitialized value in string eq by sophix (Sexton) on Apr 22, 2010 at 23:41 UTC
Thanks guys. Sure I can provide some sample. `A sample of the input file >gi\|269849759\|sp\|P04637.4\|P53_HUMAN MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA +.... >gi\|143235535\|dbj\|46784\|P53_MUS ----MEEPQSDPSVEPPL---SQETFSDLWKLL---PENNV---LSPLPSQAMDDLMLS....` [download] Above is the sample of the input file which actually contains 20 different such sequences. The first sequence in the sample refers to the protein in human while the second refers to the same protein in mouse. I have another 18 species for which sequences for the same protein exist. `A sample of the second input file P82L R112S A116V A116T [...]` [download] This second input file contains a relatively long series of disease-associated variants. That is, P82L implies that if, within the protein sequence in human, there is a change at the 82nd position (first letter in the sequence refers to the first position) from P to L, we know that this substitution is associated with some disease, say breast cancer. I am only interested in human diseases. So the positions in the variants refer to the protein sequence of human species. What I would like to do is to check whether the sequences of the same protein from other species have L (which is a pathogenic state to humans) at their respective 82nd position. GI numbers are the unique identifiers for this protein sequence in that particular species. Thus, the algorithm that I would like to have looks like this: 1. Construct a hash with identifiers (i.e., GIs) and respective sequences. 2. Figure out which sequence belongs to the human species 3. Get a variant and split it into three parts -- source, position and sink. 4. Check the human sequence for the confirmation of this variant -- check whether the human sequence indeed contains P (healthy state) at the 82nd position 5. Check all the rest of sequences to end up with a list of letters that these sequences contain at their 82nd position 6. Of interest, note down if any of these letters match the sink (i.e., L) which is a pathogenic state. Let me write down a short example. `>gi\|269849759\|sp\|P04637.4\|P53_HUMAN MEEPQSAQWCCTV >gi\|143235535\|dbj\|46784\|P53_MUS --FKQSARTYC-V >gi\|343556\|emb\|3432384\|P53_EQUUS --LKQSAPTY--S and the variant list contains: E3P Q8R T12C` [download] Thus, the desirable output should look like the following: `Variant Deviations Pathogenic_Deviations E3P F (143235535), None L (343556) Q8R P(343556) R (143235535) T12C None None` [download] This explanation took a bit long, sorry for that.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^6: Use of uninitialized value in string eq
by sophix (Sexton) on Apr 22, 2010 at 23:41 UTC

A sample of the input file

>gi|269849759|sp|P04637.4|P53_HUMAN
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA
+....
>gi|143235535|dbj|46784|P53_MUS
----MEEPQSDPSVEPPL---SQETFSDLWKLL---PENNV---LSPLPSQAMDDLMLS....
[download]

Above is the sample of the input file which actually contains 20 different such sequences. The first sequence in the sample refers to the protein in human while the second refers to the same protein in mouse. I have another 18 species for which sequences for the same protein exist.

A sample of the second input file

P82L
R112S
A116V
A116T
[...]
[download]

This second input file contains a relatively long series of disease-associated variants. That is, P82L implies that if, within the protein sequence in human, there is a change at the 82nd position (first letter in the sequence refers to the first position) from P to L, we know that this substitution is associated with some disease, say breast cancer.

I am only interested in human diseases. So the positions in the variants refer to the protein sequence of human species. What I would like to do is to check whether the sequences of the same protein from other species have L (which is a pathogenic state to humans) at their respective 82nd position.

GI numbers are the unique identifiers for this protein sequence in that particular species.

Thus, the algorithm that I would like to have looks like this:

1. Construct a hash with identifiers (i.e., GIs) and respective sequences.

2. Figure out which sequence belongs to the human species

3. Get a variant and split it into three parts -- source, position and sink.

4. Check the human sequence for the confirmation of this variant -- check whether the human sequence indeed contains P (healthy state) at the 82nd position

5. Check all the rest of sequences to end up with a list of letters that these sequences contain at their 82nd position

6. Of interest, note down if any of these letters match the sink (i.e., L) which is a pathogenic state.

Let me write down a short example.

>gi|269849759|sp|P04637.4|P53_HUMAN
MEEPQSAQWCCTV
>gi|143235535|dbj|46784|P53_MUS
--FKQSARTYC-V
>gi|343556|emb|3432384|P53_EQUUS
--LKQSAPTY--S

and the variant list contains:

E3P
Q8R
T12C
[download]

Thus, the desirable output should look like the following:

Variant    Deviations               Pathogenic_Deviations 
E3P          F (143235535),                 None
                 L (343556)                                    
Q8R         P(343556)                R (143235535)
T12C        None                                  None
[download]

This explanation took a bit long, sorry for that.

[reply]
[d/l]
[select]