Thanks guys. Sure I can provide some sample.
A sample of the input file
>gi|269849759|sp|P04637.4|P53_HUMAN
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA
+....
>gi|143235535|dbj|46784|P53_MUS
----MEEPQSDPSVEPPL---SQETFSDLWKLL---PENNV---LSPLPSQAMDDLMLS....
Above is the sample of the input file which actually contains 20 different such sequences. The first sequence in the sample refers to the protein in human while the second refers to the same protein in mouse. I have another 18 species for which sequences for the same protein exist.
A sample of the second input file
P82L
R112S
A116V
A116T
[...]
This second input file contains a relatively long series of disease-associated variants. That is, P82L implies that if, within the protein sequence in human, there is a change at the 82nd position (first letter in the sequence refers to the first position) from P to L, we know that this substitution is associated with some disease, say breast cancer.
I am only interested in human diseases. So the positions in the variants refer to the protein sequence of human species. What I would like to do is to check whether the sequences of the same protein from other species have L (which is a pathogenic state to humans) at their respective 82nd position.
GI numbers are the unique identifiers for this protein sequence in that particular species.
Thus, the algorithm that I would like to have looks like this:
1. Construct a hash with identifiers (i.e., GIs) and respective sequences. 2. Figure out which sequence belongs to the human species 3. Get a variant and split it into three parts -- source, position and sink. 4. Check the human sequence for the confirmation of this variant -- check whether the human sequence indeed contains P (healthy state) at the 82nd position 5. Check all the rest of sequences to end up with a list of letters that these sequences contain at their 82nd position 6. Of interest, note down if any of these letters match the sink (i.e., L) which is a pathogenic state.
Let me write down a short example.
>gi|269849759|sp|P04637.4|P53_HUMAN
MEEPQSAQWCCTV
>gi|143235535|dbj|46784|P53_MUS
--FKQSARTYC-V
>gi|343556|emb|3432384|P53_EQUUS
--LKQSAPTY--S
and the variant list contains:
E3P
Q8R
T12C
Thus, the desirable output should look like the following:
Variant Deviations Pathogenic_Deviations
E3P F (143235535), None
L (343556)
Q8R P(343556) R (143235535)
T12C None None
This explanation took a bit long, sorry for that.
|