astroman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Is there a PDL-threaded way to create a piddle of medians of portions of another piddle? Like this, but with an arbitrary "step" in the where statements:

pdl> $b = pdl (0,1,2,3,4,5,6,7,8,9,10,10.5,11,11.5,12,12.5,13,13.5,14, +14.5) pdl> p $b [0 1 2 3 4 5 6 7 8 9 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5] pdl> $a = pdl (0,6,18,7,19,3,10,2,12,4,8,9,1,15,11,11,19,17,0,9) pdl> p $a [0 6 18 7 19 3 10 2 12 4 8 9 1 15 11 11 19 17 0 9] pdl> $d = pdl($a(($b>0)*($b<=5);?)->medover, $a(($b>5)*($b<=10);?)->me +dover, $a(($b>10)*($b<=15);?)->medover) pdl> p $d [7 8 11]
Where instead of a huge number of $a(($b>n*x)*($b<=(n+1)*x);?)->medover statements, I could just specify the "x" to use? What I'm attempting to accomplish is a sort of "chunky" fit of data, condensing a large number of points into a series of non-overlapping representative medians.

Thanks for your help!

Replies are listed 'Best First'.
Re: Grouping one piddle based on ranges of another
by kevbot (Vicar) on Aug 27, 2015 at 06:47 UTC
    Hello astroman,

    I took some ideas from your reply to djerius to create this solution. I think this will work for you, and it keeps all the operations in piddles/PDL. Basically, I create masks for your greater than and less than or equal to conditions and apply them to "expanded" versions of your $a and $b piddles.

    #!/usr/bin/env perl use strict; use warnings; use PDL; # The number of intervals you require (your example shows 3) my $n = 3; my $a = pdl (0,6,18,7,19,3,10,2,12,4,8,9,1,15,11,11,19,17,0,9); my $b = pdl (0,1,2,3,4,5,6,7,8,9,10,10.5,11,11.5,12,12.5,13,13.5,14,14 +.5); my $gt = sequence($n)*5; # equivalent to pdl(0, 5, 10) for $n = 3 my $ltoe = (sequence($n)+1)*5; # equivalent to pdl(5, 10, 15) for $n = + 3 my $expanded_a = $a->dummy(0,$n); my $expanded_b = $b->dummy(0,$n); my $gt_mask = $expanded_b > $gt; my $ltoe_mask = $expanded_b <= $ltoe; my $mask = $gt_mask & $ltoe_mask; my $mask_w_bad = $mask->setbadif($mask == 0); my $masked = $expanded_a * $mask_w_bad; my $medians = $masked->transpose->medover; print $medians, "\n"; exit;
      Cool! Thanks for that solution, it's exactly what I was going for. :)

      Sadly, testing reveals my original hypothesis (that I could obtain stupendous speed improvements by keeping everything as piddles instead of iterating through a "for" loop) to be incorrect. I think the extra computational expense of the second dimension kills my runtime: with $a and $b of size ~3 million, and n=100, it finishes in ~20 seconds vs ~10 seconds for the "for" loop I was trying to improve:

      sub test_medians { use strict; use warnings; use PDL; $PDL::BIGPDL = 1; my $n = 100; my $step = 5; my $a = random(3000000)*100; my $b = random(3000000)*1000; $b = $b->qsort; my $d = zeroes($n); for (my $i=0;$i<$n;$i++) { $d(($i)) .= $a((($b>($i*$step))*($b<=($i+1)*$step));?)->medove +r; } return $d; }

      Nevertheless, thanks for your help! :)

Re: Grouping one piddle based on ranges of another
by djerius (Beadle) on Aug 26, 2015 at 19:25 UTC
    Update 1:

    just noted you want a threaded way of doing this; this isn't it

    Update 2:

    There's a fundamental problem with threading this. If your $b piddle is typical, your data subsets won't have the same number of matching elements in them, which makes it impossible to thread over.

    # calculate the number of elements in each chunk pdl> $idx = ($b/5)->floor pdl> $idx -= ($b - $idx * 5) == 0 pdl> p [$qidx->hist(-1,4,1)]->[1] [1 5 5 9 0]

    Original Response:

    Hi,

    Note:Your updated code doesn't compile and when fixed doesn't give the stated results, e.g.:

    pdl> p pdl($a(($b>0)*($b<=5);?)->medover, $a(($b>5)*($b<=10);?)->medov +er, $a(($b>10)*($b<=15);?)->medover) [7 8 11]
    How about this:
    pdl> $start = 0; pdl> $step = 5; pdl> $d = pdl( map { $a(($b > $start + $step *$_ )*( $b <= $start + $ +step*($_+1) );?)->medover } 0..3 ) pdl> p $d [7 8 11 13]

      My apologies; I had modified the appearance to try to make it more clear what I was attempting. I've fixed it now and the code runs as written.

      Yes, you are correct about the intermediate piddles (the ones of which I am finding the median) being different lengths. I was thinking perhaps there's some way to expand into extra dimensions, where each dimension would contain the values in a given $b range, and then take medover the entire array to condense it into the form of $d. Is this possible?

      Like this, only without having to manually type all of the $mask(..) .= 1 lines:

      pdl> $e = $a(,,*3)->copy pdl> p $e [ [ [ 0 6 18 7 19 3 10 2 12 4 8 9 1 15 11 11 19 17 0 9] ] [ [ 0 6 18 7 19 3 10 2 12 4 8 9 1 15 11 11 19 17 0 9] ] [ [ 0 6 18 7 19 3 10 2 12 4 8 9 1 15 11 11 19 17 0 9] ] ] pdl> $mask = $e->zeroes pdl> $mask(,,0)->where(($b>0)*($b<=5)) .= 1 pdl> $mask(,,1)->where(($b>5)*($b<=10)) .= 1 pdl> $mask(,,2)->where(($b>10)*($b<=15)) .= 1 pdl> p $mask [ [ [0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] ] [ [0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0] ] [ [0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1] ] ] pdl> $mask = $mask->setbadif($mask==0) pdl> p $mask [ [ [BAD 1 1 1 1 1 BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD + BAD BAD BAD] ] [ [BAD BAD BAD BAD BAD BAD 1 1 1 1 1 BAD BAD BAD BAD BAD BAD + BAD BAD BAD] ] [ [BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD 1 1 1 1 1 1 + 1 1 1] ] ] pdl> $f = $e*$mask pdl> p $f [ [ [BAD 6 18 7 19 3 BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD + BAD BAD BAD] ] [ [BAD BAD BAD BAD BAD BAD 10 2 12 4 8 BAD BAD BAD BAD BAD BAD + BAD BAD BAD] ] [ [BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD 9 1 15 11 11 19 + 17 0 9] ] ] pdl> $g = $f->medover pdl> p $g [ [ 7] [ 8] [11] ]

      Thank you very much for your help!