Peter Keystrokes has asked for the wisdom of the Perl Monks concerning the following question:

...which is populated through a loop and if statements

So I have some data:
>hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 FORWARD -4.6 12 .. 35 xxxxGTGTGTGGTCT GC TTCAGTGACTTCGAGG +CGCG GC AGCTGCTCCGAGTCC -5.5 11 .. 36 xxxxxGTGTGTGGTC TGC TTCAGTGACTTCGAGG +CGCG GCA GCTGCTCCGAGTCCT -7.8 10 .. 37 xxxxxxGTGTGTGGT CTGC TTCAGTGACTTCGAGG +CGCG GCAG CTGCTCCGAGTCCTC -4.3 9 .. 38 xxxxxxxGTGTGTGG TCTGC TTCAGTGACTTCGAGG +CGCG GCAGC TGCTCCGAGTCCTCC -4.6 31 .. 41 CAGTGACTTCGAGGC GCGG CAG CTG +C TCCGAGTCCTCCCCT -5.7 28 .. 44 CTTCAGTGACTTCGA GGCGCGG CAG CTG +CTCC GAGTCCTCCCCTGCA -5.1 20 .. 49 GTGGTCTGCTTCAGT GACTT CGAGGCGCGGCAGCTG +CTCC GAGTC CTCCCCTGCAACCAT -4.3 27 .. 56 GCTTCAGTGACTTCG AGGCG CGGCAGCTGCTCCGAG +TCCT CCCCT GCAACCATGAGTTCC -5.6 31 .. 58 CAGTGACTTCGAGGC GCGG CAGCTGCTCCGAGTCC +TCCC CTGC AACCATGAGTTCCAC -5.4 72 .. 82 GCAACCATGAGTTCC ACAC CAA GTG +T GTTGACAAGTGGTTG -7.7 71 .. 83 TGCAACCATGAGTTC CACAC CAA GTG +TG TTGACAAGTGGTTGA -4.2 70 .. 84 CTGCAACCATGAGTT CCACAC CAA GTG +TGT TGACAAGTGGTTGAA >hsa_circ_0014931|chr1:160293220-160293320-|NM_001098398|COPA FORWARD -5.5 11 .. 36 xxxxxGGTCACGATC GTG GAGTAAACTGGGCTGC +CTTC CAC CCCACTATGCCCCTT -4.5 22 .. 40 GATCGTGGAGTAAAC TGGGCTG CCTTC CA +CCCCA CTATGCCCCTTATTG -4.1 11 .. 41 xxxxxGGTCACGATC GTGGAG TAAACTGGGCTGCCTT +CCAC C-CCAC TATGCCCCTTATTGT

And I want to create a hash which has the name of the sequence 'hsa_circ...' as the key and the hash itself will be an array, containing a series of 'starts' and 'ends' which are chosen after going through some conditional if statements.

For example with respect to hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 FORWARD I want to capture this as the hash key. But then as I read the file in line by line I want to ask if statements regarding the start and end position values and then push them into the array of their respective hash.

So my problem is twofold:

(1) I want to be able to create a hash of arrays that is added to as the file is read

(2) I want to be able to read in line by line asking if statements of whether the start value and end value is within a positional vicinity of 6 such that it is essentially 1 hairpin and in which case I push into my array the start and end values which have the greatest range.

For example in:
-4.6 <b>12 .. 35</b> xxxxGTGTGTGGTCT GC TTCAGTGAC +TTCGAGGCGCG GC AGCTGCTCCGAGTCC -5.5 <b>11 .. 36</b> xxxxxGTGTGTGGTC TGC TTCAGTGAC +TTCGAGGCGCG GCA GCTGCTCCGAGTCCT -7.8 <b>10 .. 37</b> xxxxxxGTGTGTGGT CTGC TTCAGTGAC +TTCGAGGCGCG GCAG CTGCTCCGAGTCCTC -4.3 <b>9 .. 38</b> xxxxxxxGTGTGTGG TCTGC TTCAGTGAC +TTCGAGGCGCG GCAGC TGCTCCGAGTCCTCC -4.6 31 .. 41 CAGTGACTTCGAGGC GCGG CAG CTG +C TCCGAGTCCTCCCCT
The start positions and end positions surrounded by bold tags are essentially the same hairpin, so all except 9 and 38 will be ignored since it is this one that captures the hairpin best.

I am struggling to programmatically articulate these if statements though. Essentially I am trying to read each line capturing the values and assessing whether the next line bears values which are $start-1 and $end+1 and if they are, then they become the $start and $end and this is repeated until -/+6 of the orginal $start and $end value is reached

Trying to populate the hash of arrays:
#!/usr/bin/perl use strict; use warnings; open my $hairpin_file, '<', "test.hairpin", or die $!; my %HoA_sequences; while (my $line = <$hairpin_file>){ if ($line =~ /^>hsa/){ $HoA_sequences{$line} = ## Do I provide the key for the hash as soon as I read in the line bea +ring the sequence name? ## And can I create a hash with a key but no hash? I'm guessing not. : +/ } }
Trying to develop the if statements:
if ($line =~ /$RE{num}{real}\s+($RE{num}{real})\s..+($RE{num}{real})+/ +){ my $start = $1 && my $end = $2; ## Once captured how do I continue to the next line to query if the ## value is - or + of the start and end values }

Replies are listed 'Best First'.
Re: Making use of a hash of an array...
by haukex (Archbishop) on Jul 18, 2017 at 16:26 UTC

    I'm not sure I understand your specification as to what the conditions are for "captures the hairpin best", but as I understand it, it has something to do with how the ranges overlap? In that case, here are some threads that might be of interest: Calculate the overlap length between two ranges, Range overlap (with a lot of ranges), or finding intermediate range values from two file columns (in the latter I mention Interval trees, which might be useful here). If your input data is large, that might mean you need a different data structure for efficient lookups.

    As for reading the file into a data structure, that's probably easiest with a simple state machine type approach, in this example I'm keeping the state (the current key) as $curkey:

    use warnings; use strict; use Data::Dump; my %hash; my $curkey; while (<DATA>) { chomp; if (/^>(\w+)$/) { $curkey = $1; } elsif ( my ($data,$low,$high) = /^\s*(\w+)\s+(\d+)\s*\.\.\s*(\d+)\s*$/ ) { die "value seen before header: '$_'" unless defined $curkey; push @{ $hash{$curkey} }, { data=>$data, low=>$low, high=>$high }; } else { die "don't know how to parse: '$_'" } } dd \%hash; __DATA__ >key1 Foo 10 .. 20 Bar 11 .. 21 >key2 Quz 30 .. 40 Baz 37 .. 45

    Update: Minor tweaks to regex and text.

      When I say captures the hairpin best I am referring to the possible hairpin which encompasses the most residues because the end goal is to develop a general representation of any given sequence's secondary structure in terms of where hairpins pop up in the sequence. This will then be used to generate a kind of fingerprint that characterises a class of secondary structures. I have to disregard overlaps for the sake of simplicity, I am far too much of an amateur at programming to account for all the complexity in my data.

      I didn't quite understand the historic threads regarding ranges too well. I'm still trying to get to grips with some of these computer science concepts. Although I think the 'state machine' might be useful, I think that it would then just be a matter of looping through the newly created hash of arrays and calculating the ranges and feeding the highest range into a final hash.

      The thing is, is that I fear that all these loops are going to make it take a lifetime in the final analysis because I have about 140K sequences to go through with ~5+ hairpins, but for now, of course, I will test on small files.

      I will try to see if I can loop through the hash to calculate and pick out max ranges.

      Good luck me lol :/

      Thanks for your help

        Okay, I didn't understand everything in the first two sentences, I assume those are genetics / bioinformatics terms there :-)

        Even without programming anything, it should be possible to formulate specific rules for which row is to be picked, and using examples, demonstrate how those rules are applied. We could try to extract those rules from your descriptions (for example, Laurent_R asked if "lowest start and highest end" is a rule), but so far I don't think the descriptions have been enough, and of course it's easier if you work out what those rules are and tell us (first in English, code later), and show us a couple of sample inputs and the expected output for each. Then, going from that to working code will be much less difficult.

        The more advanced solutions I mentioned will only be necessary if the simple, straightforward code is too slow on your input data ("premature optimization is the root of all evil").

      Back to basics

      I'm trying to capture the start and end values. With a simple file such as:

      Regex used: /^$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ >hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 FORWARD -4.6 12 .. 35 xxxxGTGTGTGGTCT GC TTCAGTGACTTCGAGG +CGCG GC AGCTGCTCCGAGTCC -5.5 11 .. 36 xxxxxGTGTGTGGTC TGC TTCAGTGACTTCGAGG +CGCG GCA GCTGCTCCGAGTCCT

      I am able to capture the start and end values:

      Dumper: $VAR1 = 'hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 F +ORWARD'; $VAR2 = [ { 'end' => '35', 'start' => '12' }, { 'end' => '36', 'start' => '11' }

      But when I make a slight amendment in the regex to account for the lines which begin with a whitespace such as:

      New regex: /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ ## addition of (\s+)? to the beginning *\s*-5 56 .. 70 CTATGCCCCTTATTG TATCTG GGG C +AGATG ATCGTCAAGTGAAGA

      The start values become undefined:

      $VAR125 = 'hsa_circ_0067224|chr3:128345575-128345675-|NM_002950|RPN1 +FORWARD'; $VAR126 = [ { 'end' => '6', 'start' => undef }
      Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my $start and $end variables?

      Whole script so far:

      #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use Regexp::Common qw /number/; open my $hairpin_file, '<', "new_xt_spacer_results.hairpin", or die $! +; my %HoA_sequences; my $curkey; while (<$hairpin_file>){ chomp; if (/^>(\w+\d+\|\w+:\d+-\d+[-|+]\|\w+\|\w+\s+\w+$)/){ $curkey = $1; }elsif (my ($start, $end) = /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ ) { die "value seen before header: '$_'" unless defined $curkey; push @{ $HoA_sequences{$curkey}}, { start=>$start, end=>$end }; } else { die "don't know how to parse: '$_'" } } print Dumper(%HoA_sequences);

        Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my $start and $end variables?

        Yep (you could have tested this yourself). That is capturing the leading space into $1 and shifting the other two captures down to $2 and $3.

        Since you don't need to capture the leading space, don't use parentheses (or, when you must use parentheses but don't want to capture the match, use, um ... non-capturing parentheses, like: (?:foo|bar)). Since you don't know if there will be any matches, use the zero-or-more quantifier. Maybe you want:

        /^ \s* $RE{num}{real} \s+ (\d+) \s+ \. \. \s+ (\d+) \s* /x


        The way forward always starts with a minimal test.
Re: Making use of a hash of an array...
by Laurent_R (Canon) on Jul 18, 2017 at 17:15 UTC
    In your example, I understand that 9 .. 38 is the "best hairpin," since it is the lowest start and highest end. Fair enough. But what if the ranges overlap? Say, you have 9 .. 38 and 8 .. 37?

    Anyway, for populating your HOA, assuming the hsa_circ_... id is in $key, and the hairpin is delimited by $start and $end, you can use the following syntax:

    $HoA_sequences{$key} = [$start, $end];
      Okay I will try that. I'm sure it's possible for those overlaps to occur but if given multiple options, I have to settle for the most energy stable hairpin, which is why prior to this stage the energy scores were filtered for the most stable ones. So far as I can detect there aren't any overlaps of the variety you describe.
Re: Making use of a hash of an array...
by 1nickt (Canon) on Jul 18, 2017 at 16:26 UTC

    ## Do I provide the key for the hash as soon as I read in the line bearing the sequence name?

    No, you've already 'created' the key in the hash by assigning a value to it with $HoA_sequences{$line} =

    Are you using Data::Dumper or similar and liberally placed debug print statements to see what is going on with your data structure? I recommend creating a simple dummy data set and doing lots of trial and error and debug printing. Much more effective than any other method of learning this stuff.

    ## And can I create a hash with a key but no hash? I'm guessing not. :/

    If you mean "a key with no value", sure you can.

    my %hash = ( foo => 'bar', baz => '', qux => undef );
    Now print that with Data::Dumper and see what it looks like.

    ## Once captured how do I continue to the next line to query if ...

    See the flip-flop operator for one way.

    Update: added link and examples


    The way forward always starts with a minimal test.