in reply to Making use of a hash of an array...

I'm not sure I understand your specification as to what the conditions are for "captures the hairpin best", but as I understand it, it has something to do with how the ranges overlap? In that case, here are some threads that might be of interest: Calculate the overlap length between two ranges, Range overlap (with a lot of ranges), or finding intermediate range values from two file columns (in the latter I mention Interval trees, which might be useful here). If your input data is large, that might mean you need a different data structure for efficient lookups.

As for reading the file into a data structure, that's probably easiest with a simple state machine type approach, in this example I'm keeping the state (the current key) as $curkey:

use warnings; use strict; use Data::Dump; my %hash; my $curkey; while (<DATA>) { chomp; if (/^>(\w+)$/) { $curkey = $1; } elsif ( my ($data,$low,$high) = /^\s*(\w+)\s+(\d+)\s*\.\.\s*(\d+)\s*$/ ) { die "value seen before header: '$_'" unless defined $curkey; push @{ $hash{$curkey} }, { data=>$data, low=>$low, high=>$high }; } else { die "don't know how to parse: '$_'" } } dd \%hash; __DATA__ >key1 Foo 10 .. 20 Bar 11 .. 21 >key2 Quz 30 .. 40 Baz 37 .. 45

Update: Minor tweaks to regex and text.

Replies are listed 'Best First'.
Re^2: Making use of a hash of an array...
by Peter Keystrokes (Beadle) on Jul 18, 2017 at 21:50 UTC

    When I say captures the hairpin best I am referring to the possible hairpin which encompasses the most residues because the end goal is to develop a general representation of any given sequence's secondary structure in terms of where hairpins pop up in the sequence. This will then be used to generate a kind of fingerprint that characterises a class of secondary structures. I have to disregard overlaps for the sake of simplicity, I am far too much of an amateur at programming to account for all the complexity in my data.

    I didn't quite understand the historic threads regarding ranges too well. I'm still trying to get to grips with some of these computer science concepts. Although I think the 'state machine' might be useful, I think that it would then just be a matter of looping through the newly created hash of arrays and calculating the ranges and feeding the highest range into a final hash.

    The thing is, is that I fear that all these loops are going to make it take a lifetime in the final analysis because I have about 140K sequences to go through with ~5+ hairpins, but for now, of course, I will test on small files.

    I will try to see if I can loop through the hash to calculate and pick out max ranges.

    Good luck me lol :/

    Thanks for your help

      Okay, I didn't understand everything in the first two sentences, I assume those are genetics / bioinformatics terms there :-)

      Even without programming anything, it should be possible to formulate specific rules for which row is to be picked, and using examples, demonstrate how those rules are applied. We could try to extract those rules from your descriptions (for example, Laurent_R asked if "lowest start and highest end" is a rule), but so far I don't think the descriptions have been enough, and of course it's easier if you work out what those rules are and tell us (first in English, code later), and show us a couple of sample inputs and the expected output for each. Then, going from that to working code will be much less difficult.

      The more advanced solutions I mentioned will only be necessary if the simple, straightforward code is too slow on your input data ("premature optimization is the root of all evil").

        So basically I want to capture those low (start) and high (end) values as you rightly point out. It's not clear enough to notice because the formatting on this forum doesn't permit it but if you look at:
        #col_1 col_2 col_3 col_4 col_5 GTCT GC TTCAGTGACTTCGAGGCGCG GC GTCC
        This of course, is a segment of a larger genetic sequence consisting of the nucleotides adenine(A), thymine(T), cytosine(C) and guanine(G).

        Where G binds to C

        and A binds to T

        It just so happens that in this position there is a potential for the sequence to bind in on itself forming a hairpin this is more popularly referred to as a 'genetic palindrome'.

        I'll try to explain.

        There are 5 columns.

        -Columns 2-4 represent the entire hairpin.

        -Columns 2 and 4 represent the stem

        -Column 3 represents the 'spacer' or 'gap' which forms the loop

        -Columns 1 and 5 represent the nucleotide flanking either side of the hairpin.

        ______ / \ | | <--- The SPACER \ / \ / C---G <--- The STEM G---C ____/ \_____ <--- The FLANK
        Now you may notice that in my data some of the rows basically represent the same sequence, except with an extended stem. Such as:
        12 .. 35 TCT GC TTCAGTGACTTCGAGGCGCG GC AGCT 11 .. 36 GTC TGC TTCAGTGACTTCGAGGCGCG GCA GCTG 10 .. 37 GGT CTGC TTCAGTGACTTCGAGGCGCG GCAG CTGC 9 .. 38 TGG TCTGC TTCAGTGACTTCGAGGCGCG GCAGC TGCT
        Now of the 4 options of a hairpin I want to choose the most extended hairpin which is:
        9 .. 38 TGG TCTGC TTCAGTGACTTCGAGGCGCG GCAGC TGCT
        Because its stem is more robust than the relatively flimsy stem of:
        12 .. 35 TCT GC TTCAGTGACTTCGAGGCGCG GC AGCT
        Which has a measly stem of 2 bases and probably won't maintain the hairpin structure long enough in the busyness of molecular processing to have any real molecular influence in terms of gene regulation or what have you... But then again nature/biology is full of surprises and exceptions as always... :S

        So what I wanted to do is write a script that reads in these start and end values and basically detects the presence of what is essentially one hairpin, by taking the hairpin with the most extended stem. As far as I know, the data will allow for this to be achieved.

        I hope this helps.

Re^2: Making use of a hash of an array...
by Peter Keystrokes (Beadle) on Jul 19, 2017 at 19:54 UTC
    Back to basics

    I'm trying to capture the start and end values. With a simple file such as:

    Regex used: /^$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ >hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 FORWARD -4.6 12 .. 35 xxxxGTGTGTGGTCT GC TTCAGTGACTTCGAGG +CGCG GC AGCTGCTCCGAGTCC -5.5 11 .. 36 xxxxxGTGTGTGGTC TGC TTCAGTGACTTCGAGG +CGCG GCA GCTGCTCCGAGTCCT

    I am able to capture the start and end values:

    Dumper: $VAR1 = 'hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 F +ORWARD'; $VAR2 = [ { 'end' => '35', 'start' => '12' }, { 'end' => '36', 'start' => '11' }

    But when I make a slight amendment in the regex to account for the lines which begin with a whitespace such as:

    New regex: /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ ## addition of (\s+)? to the beginning *\s*-5 56 .. 70 CTATGCCCCTTATTG TATCTG GGG C +AGATG ATCGTCAAGTGAAGA

    The start values become undefined:

    $VAR125 = 'hsa_circ_0067224|chr3:128345575-128345675-|NM_002950|RPN1 +FORWARD'; $VAR126 = [ { 'end' => '6', 'start' => undef }
    Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my $start and $end variables?

    Whole script so far:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use Regexp::Common qw /number/; open my $hairpin_file, '<', "new_xt_spacer_results.hairpin", or die $! +; my %HoA_sequences; my $curkey; while (<$hairpin_file>){ chomp; if (/^>(\w+\d+\|\w+:\d+-\d+[-|+]\|\w+\|\w+\s+\w+$)/){ $curkey = $1; }elsif (my ($start, $end) = /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ ) { die "value seen before header: '$_'" unless defined $curkey; push @{ $HoA_sequences{$curkey}}, { start=>$start, end=>$end }; } else { die "don't know how to parse: '$_'" } } print Dumper(%HoA_sequences);

      Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my $start and $end variables?

      Yep (you could have tested this yourself). That is capturing the leading space into $1 and shifting the other two captures down to $2 and $3.

      Since you don't need to capture the leading space, don't use parentheses (or, when you must use parentheses but don't want to capture the match, use, um ... non-capturing parentheses, like: (?:foo|bar)). Since you don't know if there will be any matches, use the zero-or-more quantifier. Maybe you want:

      /^ \s* $RE{num}{real} \s+ (\d+) \s+ \. \. \s+ (\d+) \s* /x


      The way forward always starts with a minimal test.
        Aha, thanks.