Making use of a hash of an array...

Peter Keystrokes has asked for the wisdom of the Perl Monks concerning the following question:

...which is populated through a loop and if statements
So I have some data:

 
>hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44  FORWARD
-4.6      12 ..      35   xxxxGTGTGTGGTCT          GC TTCAGTGACTTCGAGG
+CGCG GC           AGCTGCTCCGAGTCC
-5.5      11 ..      36   xxxxxGTGTGTGGTC         TGC TTCAGTGACTTCGAGG
+CGCG GCA          GCTGCTCCGAGTCCT
-7.8      10 ..      37   xxxxxxGTGTGTGGT        CTGC TTCAGTGACTTCGAGG
+CGCG GCAG         CTGCTCCGAGTCCTC
-4.3       9 ..      38   xxxxxxxGTGTGTGG       TCTGC TTCAGTGACTTCGAGG
+CGCG GCAGC        TGCTCCGAGTCCTCC
-4.6      31 ..      41   CAGTGACTTCGAGGC                 GCGG CAG CTG
+C                 TCCGAGTCCTCCCCT
-5.7      28 ..      44   CTTCAGTGACTTCGA              GGCGCGG CAG CTG
+CTCC              GAGTCCTCCCCTGCA
-5.1      20 ..      49   GTGGTCTGCTTCAGT       GACTT CGAGGCGCGGCAGCTG
+CTCC GAGTC        CTCCCCTGCAACCAT
-4.3      27 ..      56   GCTTCAGTGACTTCG       AGGCG CGGCAGCTGCTCCGAG
+TCCT CCCCT        GCAACCATGAGTTCC
-5.6      31 ..      58   CAGTGACTTCGAGGC        GCGG CAGCTGCTCCGAGTCC
+TCCC CTGC         AACCATGAGTTCCAC
-5.4      72 ..      82   GCAACCATGAGTTCC                 ACAC CAA GTG
+T                 GTTGACAAGTGGTTG
-7.7      71 ..      83   TGCAACCATGAGTTC                CACAC CAA GTG
+TG                TTGACAAGTGGTTGA
-4.2      70 ..      84   CTGCAACCATGAGTT               CCACAC CAA GTG
+TGT               TGACAAGTGGTTGAA
>hsa_circ_0014931|chr1:160293220-160293320-|NM_001098398|COPA  FORWARD
-5.5      11 ..      36   xxxxxGGTCACGATC         GTG GAGTAAACTGGGCTGC
+CTTC CAC          CCCACTATGCCCCTT
-4.5      22 ..      40   GATCGTGGAGTAAAC             TGGGCTG CCTTC CA
+CCCCA             CTATGCCCCTTATTG
-4.1      11 ..      41   xxxxxGGTCACGATC      GTGGAG TAAACTGGGCTGCCTT
+CCAC C-CCAC       TATGCCCCTTATTGT
[download]

And I want to create a hash which has the name of the sequence 'hsa_circ...' as the key and the hash itself will be an array, containing a series of 'starts' and 'ends' which are chosen after going through some conditional if statements.

For example with respect to hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 FORWARD I want to capture this as the hash key. But then as I read the file in line by line I want to ask if statements regarding the start and end position values and then push them into the array of their respective hash.

So my problem is twofold:

(1) I want to be able to create a hash of arrays that is added to as the file is read

(2) I want to be able to read in line by line asking if statements of whether the start value and end value is within a positional vicinity of 6 such that it is essentially 1 hairpin and in which case I push into my array the start and end values which have the greatest range.

For example in:

-4.6      <b>12 ..      35</b>   xxxxGTGTGTGGTCT          GC TTCAGTGAC
+TTCGAGGCGCG GC           AGCTGCTCCGAGTCC
-5.5      <b>11 ..      36</b>   xxxxxGTGTGTGGTC         TGC TTCAGTGAC
+TTCGAGGCGCG GCA          GCTGCTCCGAGTCCT
-7.8      <b>10 ..      37</b>   xxxxxxGTGTGTGGT        CTGC TTCAGTGAC
+TTCGAGGCGCG GCAG         CTGCTCCGAGTCCTC
-4.3       <b>9 ..      38</b>   xxxxxxxGTGTGTGG       TCTGC TTCAGTGAC
+TTCGAGGCGCG GCAGC        TGCTCCGAGTCCTCC
-4.6      31 ..      41   CAGTGACTTCGAGGC                 GCGG CAG CTG
+C                 TCCGAGTCCTCCCCT
[download]

The start positions and end positions surrounded by bold tags are essentially the same hairpin, so all except 9 and 38 will be ignored since it is this one that captures the hairpin best.

I am struggling to programmatically articulate these if statements though. Essentially I am trying to read each line capturing the values and assessing whether the next line bears values which are $start-1 and $end+1 and if they are, then they become the $start and $end and this is repeated until -/+6 of the orginal $start and $end value is reached

Trying to populate the hash of arrays:

#!/usr/bin/perl 
use strict;
use warnings;

open my $hairpin_file, '<', "test.hairpin", or die $!;

my %HoA_sequences;

while (my $line = <$hairpin_file>){
    if ($line =~ /^>hsa/){
        $HoA_sequences{$line} = 
## Do I provide the key for the hash as soon as I read in the line bea
+ring the sequence name?
## And can I create a hash with a key but no hash? I'm guessing not. :
+/
    }

}
[download]

Trying to develop the if statements:



if ($line =~ /$RE{num}{real}\s+($RE{num}{real})\s..+($RE{num}{real})+/
+){
    my $start = $1 && my $end = $2;
## Once captured how do I continue to the next line to query if the 
## value is - or + of the start and end values
}
[download]

Comment on Making use of a hash of an array... Select or Download Code

Replies are listed 'Best First'.
Re: Making use of a hash of an array... by haukex (Archbishop) on Jul 18, 2017 at 16:26 UTC
I'm not sure I understand your specification as to what the conditions are for "captures the hairpin best", but as I understand it, it has something to do with how the ranges overlap? In that case, here are some threads that might be of interest: Calculate the overlap length between two ranges, Range overlap (with a lot of ranges), or finding intermediate range values from two file columns (in the latter I mention Interval trees, which might be useful here). If your input data is large, that might mean you need a different data structure for efficient lookups. As for reading the file into a data structure, that's probably easiest with a simple state machine type approach, in this example I'm keeping the state (the current key) as `$curkey`: `use warnings; use strict; use Data::Dump; my %hash; my $curkey; while (<DATA>) { chomp; if (/^>(\w+)$/) { $curkey = $1; } elsif ( my ($data,$low,$high) = /^\s(\w+)\s+(\d+)\s\.\.\s(\d+)\s$/ ) { die "value seen before header: '$_'" unless defined $curkey; push @{ $hash{$curkey} }, { data=>$data, low=>$low, high=>$high }; } else { die "don't know how to parse: '$_'" } } dd \%hash; __DATA__ >key1 Foo 10 .. 20 Bar 11 .. 21 >key2 Quz 30 .. 40 Baz 37 .. 45` [download] Update: Minor tweaks to regex and text.	[reply] [d/l] [select]
Re^2: Making use of a hash of an array... by Peter Keystrokes (Beadle) on Jul 18, 2017 at 21:50 UTC
When I say captures the hairpin best I am referring to the possible hairpin which encompasses the most residues because the end goal is to develop a general representation of any given sequence's secondary structure in terms of where hairpins pop up in the sequence. This will then be used to generate a kind of fingerprint that characterises a class of secondary structures. I have to disregard overlaps for the sake of simplicity, I am far too much of an amateur at programming to account for all the complexity in my data. I didn't quite understand the historic threads regarding ranges too well. I'm still trying to get to grips with some of these computer science concepts. Although I think the 'state machine' might be useful, I think that it would then just be a matter of looping through the newly created hash of arrays and calculating the ranges and feeding the highest range into a final hash. The thing is, is that I fear that all these loops are going to make it take a lifetime in the final analysis because I have about 140K sequences to go through with ~5+ hairpins, but for now, of course, I will test on small files. I will try to see if I can loop through the hash to calculate and pick out max ranges. Good luck me lol :/ Thanks for your help	[reply]
Re^3: Making use of a hash of an array... by haukex (Archbishop) on Jul 19, 2017 at 12:17 UTC
Okay, I didn't understand everything in the first two sentences, I assume those are genetics / bioinformatics terms there :-) Even without programming anything, it should be possible to formulate specific rules for which row is to be picked, and using examples, demonstrate how those rules are applied. We could try to extract those rules from your descriptions (for example, Laurent_R asked if "lowest start and highest end" is a rule), but so far I don't think the descriptions have been enough, and of course it's easier if you work out what those rules are and tell us (first in English, code later), and show us a couple of sample inputs and the expected output for each. Then, going from that to working code will be much less difficult. The more advanced solutions I mentioned will only be necessary if the simple, straightforward code is too slow on your input data ("premature optimization is the root of all evil").	[reply]
Re^4: Making use of a hash of an array... by Peter Keystrokes (Beadle) on Jul 19, 2017 at 20:52 UTC
Re^5: Making use of a hash of an array... by haukex (Archbishop) on Jul 23, 2017 at 20:44 UTC
Re^2: Making use of a hash of an array... by Peter Keystrokes (Beadle) on Jul 19, 2017 at 19:54 UTC
Back to basics I'm trying to capture the start and end values. With a simple file such as: `Regex used: /^$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s/ >hsa_circ_0075116\|chr5:175956288-175956388-\|NM_014901\|RNF44 FORWARD -4.6 12 .. 35 xxxxGTGTGTGGTCT GC TTCAGTGACTTCGAGG +CGCG GC AGCTGCTCCGAGTCC -5.5 11 .. 36 xxxxxGTGTGTGGTC TGC TTCAGTGACTTCGAGG +CGCG GCA GCTGCTCCGAGTCCT` [download] I am able to capture the start and end values: `Dumper: $VAR1 = 'hsa_circ_0075116\|chr5:175956288-175956388-\|NM_014901\|RNF44 F +ORWARD'; $VAR2 = [ { 'end' => '35', 'start' => '12' }, { 'end' => '36', 'start' => '11' }` [download] But when I make a slight amendment in the regex to account for the lines which begin with a whitespace such as: `New regex: /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s/ ## addition of (\s+)? to the beginning \s-5 56 .. 70 CTATGCCCCTTATTG TATCTG GGG C +AGATG ATCGTCAAGTGAAGA` [download] The start values become undefined: `$VAR125 = 'hsa_circ_0067224\|chr3:128345575-128345675-\|NM_002950\|RPN1 +FORWARD'; $VAR126 = [ { 'end' => '6', 'start' => undef }` [download] Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my `$start` and `$end` variables? Whole script so far: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use Regexp::Common qw /number/; open my $hairpin_file, '<', "new_xt_spacer_results.hairpin", or die $! +; my %HoA_sequences; my $curkey; while (<$hairpin_file>){ chomp; if (/^>(\w+\d+\\|\w+:\d+-\d+[-\|+]\\|\w+\\|\w+\s+\w+$)/){ $curkey = $1; }elsif (my ($start, $end) = /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ ) { die "value seen before header: '$_'" unless defined $curkey; push @{ $HoA_sequences{$curkey}}, { start=>$start, end=>$end }; } else { die "don't know how to parse: '$_'" } } print Dumper(%HoA_sequences); [download]	[reply] [d/l] [select]
Re^3: Making use of a hash of an array... by 1nickt (Canon) on Jul 19, 2017 at 20:47 UTC
Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my $start and $end variables? Yep (you could have tested this yourself). That is capturing the leading space into `$1` and shifting the other two captures down to `$2` and `$3`. Since you don't need to capture the leading space, don't use parentheses (or, when you must use parentheses but don't want to capture the match, use, um ... non-capturing parentheses, like: `(?:foo\|bar)`). Since you don't know if there will be any matches, use the zero-or-more quantifier. Maybe you want: `/^ \s* $RE{num}{real} \s+ (\d+) \s+ \. \. \s+ (\d+) \s* /x` [download] The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^4: Making use of a hash of an array... by Peter Keystrokes (Beadle) on Jul 19, 2017 at 20:57 UTC
Re: Making use of a hash of an array... by Laurent_R (Canon) on Jul 18, 2017 at 17:15 UTC
In your example, I understand that 9 .. 38 is the "best hairpin," since it is the lowest start and highest end. Fair enough. But what if the ranges overlap? Say, you have 9 .. 38 and 8 .. 37? Anyway, for populating your HOA, assuming the `hsa_circ_...` id is in $key, and the hairpin is delimited by $start and $end, you can use the following syntax: `$HoA_sequences{$key} = [$start, $end];` [download]	[reply] [d/l] [select]
Re^2: Making use of a hash of an array... by Peter Keystrokes (Beadle) on Jul 18, 2017 at 18:02 UTC
Okay I will try that. I'm sure it's possible for those overlaps to occur but if given multiple options, I have to settle for the most energy stable hairpin, which is why prior to this stage the energy scores were filtered for the most stable ones. So far as I can detect there aren't any overlaps of the variety you describe.	[reply]
Re: Making use of a hash of an array... by 1nickt (Canon) on Jul 18, 2017 at 16:26 UTC
## Do I provide the key for the hash as soon as I read in the line bearing the sequence name? No, you've already 'created' the key in the hash by assigning a value to it with `$HoA_sequences{$line} =` Are you using Data::Dumper or similar and liberally placed debug print statements to see what is going on with your data structure? I recommend creating a simple dummy data set and doing lots of trial and error and debug printing. Much more effective than any other method of learning this stuff. ## And can I create a hash with a key but no hash? I'm guessing not. :/ If you mean "a key with no value", sure you can. `my %hash = ( foo => 'bar', baz => '', qux => undef );` [download] Now print that with Data::Dumper and see what it looks like. ## Once captured how do I continue to the next line to query if ... See the flip-flop operator for one way. Update: added link and examples The way forward always starts with a minimal test.	[reply] [d/l] [select]