Re: taking a subset of HoHoA, retaining relative structure?

I've put together some not very elegant code that seems to do pretty much what you want.

Concerning overlaps, as brought up by jdporter, your test data actually contains such an instance. Consider the sequence:

( 5900, 5990, 6000, 6402, 6450 )

If $interval is 500, how should this sequence be dealt with? For the purposes of my code, I interpreted your words "find all the groups that are within $interval of each other" literally, splitting this sequence into two "clusters", namely:

( 5900, 5990, 6000 ) and ( 5990, 6000, 6402, 6450 )

But the alternative solution offered in your follow-up post would be a lot easier to code. ;)

The code also deals with numbers appearing in multiple lists.

This is the Data::Dump output for the first roman numeral:

(
  "I",
  [
    [1, [4, 10, 200], 3, [0, 95], 2, [75, 350]],
    [1, [6000], 3, [5990], 2, [5900]],
    [1, [6000], 3, [5990, 6450], 2, [6402]],
    [1, [7000, 7100], 2, [7110]],
  ],
)
[download]

Oh, and by the way, you shouldn't use $a and $b as variable names, or strictures (you do use strict; don't you?) will complain when you do a sort:

Can't use "my $a" in sort comparison at path/to/prog.pl line 30.

Anyway, here's the program:

use strict;
use warnings;
use Data::Dump 'dump';

my $INTERVAL = 500;

# Don't know how you get your indices, so:
my ( $x, $y, $z ) = ( 1, 2, 3 );
my @indices = ( $x, $y, $z );

my %orig = (
  I => {
    $x => [ 4, 10, 200, 6000, 7000, 7100 ],
    $y => [ 75, 350, 5900, 6402, 7110 ],
    $z => [ 0, 95, 5990, 6450 ] },
  II => {
    $x => [ 0, 20, 501, 1000 ],
    $y => [ 22, 23, 24 ],
    $z => [ 22, 23, 500 ] },
);

my %outhash;

foreach my $key ( keys %orig ) {
  my ( @merged, @graded, @prev );

  foreach my $a_ref ( @indices ) {
    push @merged, map { $_ . '.' . $a_ref } @{ $orig{$key}{$a_ref} };
  }
  @merged = sort { $a <=> $b } @merged;

  foreach my $i ( 0 .. $#merged ) {
    my @cluster;
    foreach my $j ( $i .. $#merged ) {
      last if $merged[$j] - $merged[$i] > $INTERVAL;
      push @cluster, $merged[$j];
    }
    if ( not subset_of( \@prev, \@cluster ) ) {
      push @graded, [ @cluster ];
    }
    @prev = @cluster;
  }

  foreach my $a_ref ( @graded ) {
    my %HoA;
    for ( @{ $a_ref } ) {
      my ( $int, $idx ) = split /\./;
      push @{ $HoA{$idx} }, $int;
    }
    push @{ $outhash{$key} }, [ %HoA ];
  }
}

dump( %outhash );

sub subset_of {
  my %first = map { $_ => undef } @{ $_[0] };
  my @second = @{ $_[1] };
  for ( @second ) {
    return 0 unless exists $first{$_};
  }
  return 1;
}
[download]

Comment on Re: taking a subset of HoHoA, retaining relative structure? Select or Download Code

Replies are listed 'Best First'.
Re^2: taking a subset of HoHoA, retaining relative structure? by mdunnbass (Monk) on Oct 19, 2006 at 20:07 UTC
Thanks. I haven't had a chance yet to look over your code, but I'll take a look tomorrow. Thanks though. As for the variable names, I was only using `$a` and `$b` in the example. As it stands, in my program, the keys corresponding to the RomanNumerals, and the letters, are actually created on the fly via iterations. And yes, I do use `use strict` So, I think I actually figured out a way to do it myself. I haven't tried this or denugged it, but it logically seems to do what want, if inelegantly. I just use `for keys...` to go through each hash and sub-hash until I get to the elements, and then I compare that to each individual element with another set of `for keys...` calls. Here's the crux of it, sorry if all the variable names are hard to follow, I'm just in a bit of a rush and aren't up to fixing it for logical readability right now.... sub WEED { # this sets up that we will perform this for each fasta sequence in w +hatever # file. makes the rest of the sub one step simpler. FASTA: for my $fastaseq (keys %matches) { $setscounter = 0; SITE: for my $site (sort {$a <=> $b } keys %{$matches{$fastaseq}}) +{ ELEM: for my $i (@{$matches{$fastaseq}{$site}}) { $lowerlimit = $matches{$fastaseq}{$site}[$i]; $upperlimit = $span + $lowerlimit; SITEKEY: for my $sitekey (sort {$a <=> $b } keys %{$matches{$fa +staseq}}) { SET: for $setcount (@{$matches{$fastaseq}{$sitekey}}) { if ($setcount >= $lowerlimit \|\| $_ <= $upperlimit) { push @{$sets{$fastaseq}[$setscounter]{$sitekey}}, $setcou +nt; #closes If setcount } next SET; #closes For setcount } next SITEKEY; #closes SITEKEY } next ELEM; # closes for my i } $setscounter++; next SITE; #closes for my site } next FASTA; #closes for my fastaseq } return %sets; #closes subroutine } [download] Thanks all for the help so far, Matt	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: taking a subset of HoHoA, retaining relative structure?
by mdunnbass (Monk) on Oct 19, 2006 at 20:07 UTC

As for the variable names, I was only using $a and $b in the example. As it stands, in my program, the keys corresponding to the RomanNumerals, and the letters, are actually created on the fly via iterations. And yes, I do use use strict

So, I think I actually figured out a way to do it myself. I haven't tried this or denugged it, but it logically seems to do what want, if inelegantly. I just use for keys... to go through each hash and sub-hash until I get to the elements, and then I compare that to each individual element with another set of for keys... calls.

Here's the crux of it, sorry if all the variable names are hard to follow, I'm just in a bit of a rush and aren't up to fixing it for logical readability right now....

sub WEED {
 # this sets up that we will perform this for each fasta sequence in w
+hatever
 # file.  makes the rest of the sub one step simpler.
 FASTA: for my $fastaseq (keys %matches) {
   $setscounter = 0;
   SITE: for my $site (sort {$a <=> $b } keys %{$matches{$fastaseq}}) 
+{
     ELEM: for my $i (@{$matches{$fastaseq}{$site}}) {
       $lowerlimit = $matches{$fastaseq}{$site}[$i];
       $upperlimit = $span + $lowerlimit;
       SITEKEY: for my $sitekey (sort {$a <=> $b } keys %{$matches{$fa
+staseq}}) {
         SET: for $setcount (@{$matches{$fastaseq}{$sitekey}}) {
           if ($setcount >= $lowerlimit || $_ <= $upperlimit) {
             push @{$sets{$fastaseq}[$setscounter]{$sitekey}}, $setcou
+nt;
             #closes If setcount
             }
           next SET;
           #closes For setcount
           }
         next SITEKEY;  
         #closes SITEKEY  
         }    
       next ELEM;
       # closes for my i
       }
     $setscounter++;
     next SITE;
     #closes for my site
     }
   next FASTA;
   #closes for my fastaseq
   }
 return %sets;
 #closes subroutine
 }
[download]

[reply]
[d/l]
[select]