Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am working with protein sequences and I have created a hash of array which holds the position and the respective amino acid, like:
1 M, 2 K, 3 L, $HoA_sequence{$protein} = [$position, $aminoacid];
My problem is that, in some cases, there are missing amino acids, and I have a "jump" in the numbering, something like:
67 A, 68 S, 77 W, 78 P, 79 I

So, can you help me how I can, my iterating in my Hash of array, add "-" in the missing positions? So that the final HoA would be this:
67 A, 68 S, 69 -, 70 -, 71 -, 72 -, 73 -, 74 -, 75 -, 76 -, 77 W, 78 P, 79 I

Replies are listed 'Best First'.
Re: How to add missing part in a Hash of Array
by Athanasius (Archbishop) on May 01, 2014 at 11:48 UTC

    Here is one way:

    #! perl use strict; use warnings; use Data::Dump; my %HoA_sequence = ( 67 => 'A', 68 => 'S', 77 => 'W', 78 => 'P', 79 => 'I', ); $HoA_sequence{$_} //= '-' for 67 .. 79; dd \%HoA_sequence;

    Output:

    21:40 >perl 904_SoPW.pl { 67 => "A", 68 => "S", 69 => "-", 70 => "-", 71 => "-", 72 => "-", 73 => "-", 74 => "-", 75 => "-", 76 => "-", 77 => "W", 78 => "P", 79 => "I", } 21:42 >

    Note: the line $HoA_sequence{$_} //= '-' for 67 .. 79; could be written more verbosely as:

    for my $protein (67 .. 79) { unless (exists $HoA_sequence{$protein}) { $HoA_sequence{$protein} = '-'; } }

    which may be clearer if you’re not comfortable with Perl’s defined-or operator and statement modifiers.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I have created the Hash of Array based on a tab separated file... Thank you for the suggestion, but the thing is that I can't know beforehand which is the range of missing positions... I was thinking of having a way to "check" that there is a gap in the positions array, but how?
        You might have read it too quickly, but Athanasius's solution does not rely in any way on finding the range of missing positions, but only on knowing the current lowest and highest valid (i.e. defined) positions in the hash (i.e. smallest and highest keys). It is quite easy to find them, using the min and max functions of the List::Utils module or rolling out your own algorithm to do it:
        my ($min, $max) = (1e10, -1e10); for my $value (keys %hash) { $min = $value if $value < $min; $max = $value if $value > $max; } # $now $min has the smallest key and $max the highest one
        Or, if your data is small, you could possibly even use the sort function to do it in one single instruction:
        my ($min, $max) = (sort {$a<=>$b} keys %hash)[0,-1];
        But that's getting quickly rather inefficient when the data is growing.

        Maybe this helps?

        use List::Util qw/min max/; # ... $HoA_sequence{$_} //= '-' for min(keys %HoA_sequence) .. max(keys %HoA_sequence);
Re: How to add missing part in a Hash of Array
by rnewsham (Curate) on May 01, 2014 at 11:52 UTC

    One simple approach is to determine the highest key. Then loop over every possible key adding '-' if there is no value for that key. I have a feeling there may be a better way to do this for large datasets but on a small simple scale it works.

    use strict; use warnings; my %data = ( 1 => 'A', 2 => 'B', 3 => 'C', 6 => 'D', 9 => 'E', 10 => 'F', ); my $max_key = 0; for ( keys %data ) { $max_key = $_ if $_ > $max_key; } for ( 1 .. $max_key ) { $data{$_} = '-' unless $data{$_}; } print "$_ : $data{$_}\n" for keys %data; #Outputs 6 : D 3 : C 7 : - 9 : E 2 : B 8 : - 1 : A 4 : - 10 : F 5 : -
Re: How to add missing part in a Hash of Array
by BillKSmith (Monsignor) on May 01, 2014 at 16:15 UTC
    I suspect that we have all misunderstood your data structure. The only code that you provided does indeed reference a "hash of arrays". It associates an array consisting of one position and the name of one amino acid with each protein. (All the responses so far assume that you hve an array of these arrays for each protein). It would be easier to work with a hash of hashes which associates an amino acid with its protein and its position.
    Bill
      Indeed, I fixed it with Hash of Hashes!
      Thank you all for your help guys!
Re: How to add missing part in a Hash of Array
by ww (Archbishop) on May 01, 2014 at 11:49 UTC
    What have you tried?

    This is not a code-writing service; it's a tutorial, in the sense used by the British U.

    Cf: On asking for help and How do I post a question effectively?

    Update: I asked AnomalousMonk for a suggestion for better phrasing of what even I knew (at the time) to be fair game for requests to distinguish the "tutorial" in para 2 from the sense of "tutorial" at Wossamotta U. He offered an excellent response:

    Maybe something like "... a tutorial, in the sense used by the British U, i.e., a collegiate effort at self-directed learning — and we, your colleagues, are here to help your effort."

    Please direct any further upvotes to Re^2: How to add missing part in a Hash of Array and its author!



    Quis custodiet ipsos custodes. Juvenal, Satires

      As opposed to the sense used by Wossamotta U?