in reply to Re^2: Making use of a hash of an array...
in thread Making use of a hash of an array...

Okay, I didn't understand everything in the first two sentences, I assume those are genetics / bioinformatics terms there :-)

Even without programming anything, it should be possible to formulate specific rules for which row is to be picked, and using examples, demonstrate how those rules are applied. We could try to extract those rules from your descriptions (for example, Laurent_R asked if "lowest start and highest end" is a rule), but so far I don't think the descriptions have been enough, and of course it's easier if you work out what those rules are and tell us (first in English, code later), and show us a couple of sample inputs and the expected output for each. Then, going from that to working code will be much less difficult.

The more advanced solutions I mentioned will only be necessary if the simple, straightforward code is too slow on your input data ("premature optimization is the root of all evil").

  • Comment on Re^3: Making use of a hash of an array...

Replies are listed 'Best First'.
Re^4: Making use of a hash of an array...
by Peter Keystrokes (Beadle) on Jul 19, 2017 at 20:52 UTC
    So basically I want to capture those low (start) and high (end) values as you rightly point out. It's not clear enough to notice because the formatting on this forum doesn't permit it but if you look at:
    #col_1 col_2 col_3 col_4 col_5 GTCT GC TTCAGTGACTTCGAGGCGCG GC GTCC
    This of course, is a segment of a larger genetic sequence consisting of the nucleotides adenine(A), thymine(T), cytosine(C) and guanine(G).

    Where G binds to C

    and A binds to T

    It just so happens that in this position there is a potential for the sequence to bind in on itself forming a hairpin this is more popularly referred to as a 'genetic palindrome'.

    I'll try to explain.

    There are 5 columns.

    -Columns 2-4 represent the entire hairpin.

    -Columns 2 and 4 represent the stem

    -Column 3 represents the 'spacer' or 'gap' which forms the loop

    -Columns 1 and 5 represent the nucleotide flanking either side of the hairpin.

    ______ / \ | | <--- The SPACER \ / \ / C---G <--- The STEM G---C ____/ \_____ <--- The FLANK
    Now you may notice that in my data some of the rows basically represent the same sequence, except with an extended stem. Such as:
    12 .. 35 TCT GC TTCAGTGACTTCGAGGCGCG GC AGCT 11 .. 36 GTC TGC TTCAGTGACTTCGAGGCGCG GCA GCTG 10 .. 37 GGT CTGC TTCAGTGACTTCGAGGCGCG GCAG CTGC 9 .. 38 TGG TCTGC TTCAGTGACTTCGAGGCGCG GCAGC TGCT
    Now of the 4 options of a hairpin I want to choose the most extended hairpin which is:
    9 .. 38 TGG TCTGC TTCAGTGACTTCGAGGCGCG GCAGC TGCT
    Because its stem is more robust than the relatively flimsy stem of:
    12 .. 35 TCT GC TTCAGTGACTTCGAGGCGCG GC AGCT
    Which has a measly stem of 2 bases and probably won't maintain the hairpin structure long enough in the busyness of molecular processing to have any real molecular influence in terms of gene regulation or what have you... But then again nature/biology is full of surprises and exceptions as always... :S

    So what I wanted to do is write a script that reads in these start and end values and basically detects the presence of what is essentially one hairpin, by taking the hairpin with the most extended stem. As far as I know, the data will allow for this to be achieved.

    I hope this helps.

      Thanks very much for the explanation, I do understand it! I don't have much time right now to look at it more deeply, but I do have two more questions about the sample data you showed in the OP: What is the first column? You mentioned something about "the most energy stable hairpin", is this related and could you use the first column to select the hairpin?

      Second, do I understand correctly that the first ~13 lines of sample data in the OP contain what looks like four different hairpins? Do you need to select one out of each of the four? Or, if not and you only need to choose a single hairpin out of all of that data, by what criteria do you choose which one?