taking a subset of HoHoA, retaining relative structure?

mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

Gretings Monks!

I am trying to search a Hash of Hashes of Arrays (%HoHoA), copying all matches to a new Hash of Arrays of Hashes of Arrays (%HoAoHoA). Something like the following:

$interval = <STDIN>; #a number, typically on the order of 500
%HoHoA = (
  'I' => {
     $a => [4,10,200,6000,7000,7100],
     $b => [75,350,5900,6402,7110],
     $c => [0,95,5990,6450] },
  'II' => {
     $a => [lots of numbers],
     $b => [lots of numbers],
     $c => [lots of numbers] },
  'III' => {
     $a => [lots of numbers],
     $b => [lots of numbers],
     $c => [lots of numbers] } );
[download]

I'd then have some nifty search routine (which is where I'm asking for help) that would for each key (I, II, III, etc), look at all the array elements for all it's sub-keys (a, b, c, etc), and from those, find all the groups that are within$interval of each other, and return these still in their respective organizational structure. so, thinking codally here,

for $RomanNumeral (keys %HoHoA) {   
# For each key RomanNumeral
  for $letter (keys sort {$a <=> $b } %{$HoHoA{$RomanNumeral}} { 
  # The sort will work here, because in this case, the $a, $b, etc are
+ indices
  # For each lettered array in that key
    $i = 0;
    for ($i < length($HoHoA{$RomanNumeral}{$letter})-1) {
      $lowerlimit = $HoHoA{$RomanNumeral}{$letter}[$i];
      $upperlimit = $interval + $lowerlimit;
      #Set a lower limit and an upper limit to the range of numbers we
+ are looking for
        if ($element >= $lowerlimit && $element <= $upperlimit) {
           save $element's value to a new array, keyed by both $RomanN
+umeral and $letter
          $element here corresponds to all values in the hash %{HoHoA{
+$RomanNumeral}}, regardless of their $letter key. 
        }
      $i++;
     }
  }
[download]

The desired output would then be a Hash whose keys are $RomanNumerals, and whose values would be Arrays. Each element of those arrays would correspond to a set of matches to all the numbers within $interval of each other, starting from the lowest element in the %{$HoHoA{$RomanNumeral}} hash, regardless of it's array of origin. The format of the elements of those arrays however, would be hashes, where the elements of the match would be stored as arrays which are the values corresponding to the $letter keys.

Given the first roman numeral in the top block of code I included, the matches would be:

$interval = 500;
%HoAoHoA = (
  'I' => [
     {$a => [4,10,200],
      $b => [75,350],
      $c => [0, 95]},

     {$a => [6000],
      $b =[5900],
      $c =>[5990]},

     {$a => [],
      $b =[6402],
      $c =>[6450]},

     {$a => [7000,7100],
      $b =[7110],
      $c =>[]},
[download]

I hope that makes sense to anyone other than me....

I'd appreciate any help anyone can offer me in figuring out the search mechanism in the middle block. I am sure that I could figure it out given enough time and Google searches, but if anyone can help me cut through it more quickly, I'd be greatly obliged.

Thanks!
Matt

Comment on taking a subset of HoHoA, retaining relative structure? Select or Download Code

Replies are listed 'Best First'.
Re: taking a subset of HoHoA, retaining relative structure? by jdporter (Paladin) on Oct 18, 2006 at 20:33 UTC
Probably the first thing I'd do is invert the HoA of the source structure, since the numbers are of primary importance; you won't need the letters ($a, etc.) until constructing the new structure. So `'I' => { $a => [4,10,200], $b => [75,350], . . .` [download] becomes `'I' => { 4 => $a, 10 => $a, 200 => $a, 75 => $b, 350 => $b, . . .` [download] But you didn't say what should happen if a number appears in more than one list, e.g. `'I' => { $a => [10,200], $b => [75,200,350], . . .` [download] Maybe that just won't happen. More importantly, what should be the behavior if a number is in two different "clusters", e.g. `$interval = 500; 'I' => { $a => [200,500,800], . . .` [download] In this case, 200 and 500 could be a cluster, and 500 and 800 could be a cluster. (Again, maybe this just won't happen.) We're building the house of the future together.	[reply] [d/l] [select]
Re^2: taking a subset of HoHoA, retaining relative structure? by mdunnbass (Monk) on Oct 18, 2006 at 20:40 UTC
Good questions, all. I didn't address them at first for 2 basic reasons. First, my post was already so long, and second, I haven't the foggiest notion of what I should do. I guess, in terms of the latter question `$interval = 500; 'I' => { $a => [200,500,800], ...` [download] I guess my thoughts are, it's not likely to happen, but can. In that case, I was sort of thinking that if they overlap by $interval/2 or more, then just lump them together. Although, this can be dealt with at a later time. As for a number appearing in more than 1 list, ideally, again, this wouldn't happen, but I have to assume it will. Always assume the user will f*** up your program somehow, right? Matt	[reply] [d/l]
Re: taking a subset of HoHoA, retaining relative structure? by Not_a_Number (Prior) on Oct 19, 2006 at 14:08 UTC
I've put together some not very elegant code that seems to do pretty much what you want. Concerning overlaps, as brought up by jdporter, your test data actually contains such an instance. Consider the sequence: `( 5900, 5990, 6000, 6402, 6450 )` If `$interval` is 500, how should this sequence be dealt with? For the purposes of my code, I interpreted your words "find all the groups that are within $interval of each other" literally, splitting this sequence into two "clusters", namely: `( 5900, 5990, 6000 )` and `( 5990, 6000, 6402, 6450 )` But the alternative solution offered in your follow-up post would be a lot easier to code. ;) The code also deals with numbers appearing in multiple lists. This is the Data::Dump output for the first roman numeral: `( "I", [ [1, [4, 10, 200], 3, [0, 95], 2, [75, 350]], [1, [6000], 3, [5990], 2, [5900]], [1, [6000], 3, [5990, 6450], 2, [6402]], [1, [7000, 7100], 2, [7110]], ], )` [download] Oh, and by the way, you shouldn't use `$a` and `$b` as variable names, or strictures (you do `use strict;` don't you?) will complain when you do a sort: `Can't use "my $a" in sort comparison at path/to/prog.pl line 30.` Anyway, here's the program: Read more... (2 kB)	[reply] [d/l] [select]
Re^2: taking a subset of HoHoA, retaining relative structure? by mdunnbass (Monk) on Oct 19, 2006 at 20:07 UTC
Thanks. I haven't had a chance yet to look over your code, but I'll take a look tomorrow. Thanks though. As for the variable names, I was only using `$a` and `$b` in the example. As it stands, in my program, the keys corresponding to the RomanNumerals, and the letters, are actually created on the fly via iterations. And yes, I do use `use strict` So, I think I actually figured out a way to do it myself. I haven't tried this or denugged it, but it logically seems to do what want, if inelegantly. I just use `for keys...` to go through each hash and sub-hash until I get to the elements, and then I compare that to each individual element with another set of `for keys...` calls. Here's the crux of it, sorry if all the variable names are hard to follow, I'm just in a bit of a rush and aren't up to fixing it for logical readability right now.... sub WEED { # this sets up that we will perform this for each fasta sequence in w +hatever # file. makes the rest of the sub one step simpler. FASTA: for my $fastaseq (keys %matches) { $setscounter = 0; SITE: for my $site (sort {$a <=> $b } keys %{$matches{$fastaseq}}) +{ ELEM: for my $i (@{$matches{$fastaseq}{$site}}) { $lowerlimit = $matches{$fastaseq}{$site}[$i]; $upperlimit = $span + $lowerlimit; SITEKEY: for my $sitekey (sort {$a <=> $b } keys %{$matches{$fa +staseq}}) { SET: for $setcount (@{$matches{$fastaseq}{$sitekey}}) { if ($setcount >= $lowerlimit \|\| $_ <= $upperlimit) { push @{$sets{$fastaseq}[$setscounter]{$sitekey}}, $setcou +nt; #closes If setcount } next SET; #closes For setcount } next SITEKEY; #closes SITEKEY } next ELEM; # closes for my i } $setscounter++; next SITE; #closes for my site } next FASTA; #closes for my fastaseq } return %sets; #closes subroutine } [download] Thanks all for the help so far, Matt	[reply] [d/l] [select]