Thanks guys. Sure I can provide some sample.

A sample of the input file >gi|269849759|sp|P04637.4|P53_HUMAN MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA +.... >gi|143235535|dbj|46784|P53_MUS ----MEEPQSDPSVEPPL---SQETFSDLWKLL---PENNV---LSPLPSQAMDDLMLS....

Above is the sample of the input file which actually contains 20 different such sequences. The first sequence in the sample refers to the protein in human while the second refers to the same protein in mouse. I have another 18 species for which sequences for the same protein exist.

A sample of the second input file P82L R112S A116V A116T [...]

This second input file contains a relatively long series of disease-associated variants. That is, P82L implies that if, within the protein sequence in human, there is a change at the 82nd position (first letter in the sequence refers to the first position) from P to L, we know that this substitution is associated with some disease, say breast cancer.

I am only interested in human diseases. So the positions in the variants refer to the protein sequence of human species. What I would like to do is to check whether the sequences of the same protein from other species have L (which is a pathogenic state to humans) at their respective 82nd position.

GI numbers are the unique identifiers for this protein sequence in that particular species.

Thus, the algorithm that I would like to have looks like this:

1. Construct a hash with identifiers (i.e., GIs) and respective sequences.

2. Figure out which sequence belongs to the human species

3. Get a variant and split it into three parts -- source, position and sink.

4. Check the human sequence for the confirmation of this variant -- check whether the human sequence indeed contains P (healthy state) at the 82nd position

5. Check all the rest of sequences to end up with a list of letters that these sequences contain at their 82nd position

6. Of interest, note down if any of these letters match the sink (i.e., L) which is a pathogenic state.

Let me write down a short example.

>gi|269849759|sp|P04637.4|P53_HUMAN MEEPQSAQWCCTV >gi|143235535|dbj|46784|P53_MUS --FKQSARTYC-V >gi|343556|emb|3432384|P53_EQUUS --LKQSAPTY--S and the variant list contains: E3P Q8R T12C

Thus, the desirable output should look like the following:

Variant Deviations Pathogenic_Deviations E3P F (143235535), None L (343556) Q8R P(343556) R (143235535) T12C None None

This explanation took a bit long, sorry for that.


In reply to Re^6: Use of uninitialized value in string eq by sophix
in thread Use of uninitialized value in string eq by sophix

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.