in reply to Re^6: Better mousetrap (getting top N values from list X)
in thread Better mousetrap (getting top N values from list X)

Understood. The reason for my reluctance to do things one at a time is because I saw the effects of doing a C to perl callback for the comparator. Just calling the comparator (without actually using it's return value) drops my topN() routine from first, to second last place in the benchmark.

[ 1:01:43.25] P:\test>427029 -MAX=100 -N=10 10 from 100 (1 seconds) 100|99|98|97|96|95|94|93|92|91 Rate Ari Buk LR BukLR perl C Ari 464/s -- -74% -83% -93% -98% -99% Buk 1768/s 281% -- -35% -74% -93% -97% LR 2731/s 488% 54% -- -60% -89% -95% BukLR 6876/s 1381% 289% 152% -- -72% -88% perl 24678/s 5217% 1296% 804% 259% -- -57% C 57443/s 12276% 3149% 2004% 735% 133% -- [ 1:02:07.47] P:\test>427029 -MAX=100 -N=10 10 from 100 (1 seconds) 100|99|98|97|96|95|94|93|92|91 Rate Ari C Buk LR BukLR perl Ari 477/s -- -32% -72% -79% -92% -98% C 703/s 47% -- -59% -69% -88% -97% Buk 1729/s 262% 146% -- -24% -71% -93% LR 2266/s 375% 222% 31% -- -61% -91% BukLR 5881/s 1132% 737% 240% 159% -- -77% perl 25039/s 5146% 3463% 1348% 1005% 326% --

There may be some fat that could be trimmed as I put the callback code in a separate C function (per the examples) in order to isolate all the stack manipulations.

That could probably be inlined with care, but I also tried calling a dummy C-function and it is very clear that it is the callback to perl, not the call to the C function that is chewing all the time. Once you get into the C code, it's better to stay and do as much work as possible in one go.

That said, I guess you could alway do the batching within the wrapping code also, which would make sense.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^8: Better mousetrap (getting top N values from list X)
by tall_man (Parson) on Feb 04, 2005 at 18:42 UTC
    Out of curiousity, was your comparator of the {$a <=> $b} type, or did you pass parameters to it? It seems as if we could tap into sort's efficient comparator speed if we could use $a and $b from C code. Generic comparisons might be easier that way.
      Out of curiousity, was your comparator of the {$a <=> $b} type, or did you pass parameters to it?

      I haven't worked out how to set $a and $b yet. This was the C wrapper function I used for calling the comparator.

      int callComp( SV* cmp, int a, int b ) { int count, rv; dSP; ENTER; SAVETMPS; PUSHMARK(SP); XPUSHs( sv_2mortal( newSViv( a ) ) ); XPUSHs( sv_2mortal( newSViv( b ) ) ); PUTBACK; if( ( count = call_sv( cmp, G_SCALAR ) ) != 1 ) croak( "Bad comparator" ); SPAGAIN; rv = POPi; FREETMPS; LEAVE; }

      That said, I am not sure how much difference it would make to the performance? You get the best performance from sort when you don't actually do anything in the comparator sub, which allows perl to completely skip the setting of parameters and the callback to perl.

      But I think that the decision that allows perl to do that is embedded deep within the parser, a very specific special case, not something that could easily be transferred to extensions. I don't think it uses the hints mechanism?

      That's what I suggested that when the programmer codes @s = sort{$a<=>$b} @u;, the interpreter effectively reads that as @s = sortNumericAscending @u; and sort{$bcmp$a} is effectively translated to 'sortAlphaDescending'.

      Indeed, in the latest versions, @s = reverse sort @u; is also interpreted as 'sortAlphaDescending', but I don't think there is any easy way to apply that type of reinterpretation to extensions.

      Actually, I think that 4 named variants would be cleaner both from the point of view of understandability: topN() or Ngreatest() doesn't make a great deal of sense in the case of wanting the N lowest values; and efficiency: avoiding the conditional code to decide which way we are ordering.

      To fit in with the existing naming conventions of List::Util, they should probably be called: maxN(), minN(), maxNstr() & minNstr(). I don't really like the "defaulting" of min() and max() to numeric--which is the opposite to sort--or the "ommision" of 'num' from the names, but that's what now exists, and consistancy would be a good thing.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        I read once that the $a,$b trick was done for performance reasons. I know the optimimum use of sort is for perl to call predefined routines. However, sort achieves reasonable performance even for nonstandard user-defined functions. Here is what I tried, which seg-faulted. I need more in-depth perlguts knowledge to make it work. Probably I should give up and write the individual sorts.
        #!/usr/bin/perl use strict; use warnings; use Algorithm::Numerical::Shuffle qw( shuffle ); use Inline C => 'DATA'; my $max = 10; my @values = 1 .. $max; my @values_mixed = shuffle(@values); my @top = topNbs(5, sub { $a <=> $b }, \@values_mixed); print join(" ",sort { $a <=> $b } @top),"\n"; __END__ __C__ int callComp( SV* cmp, GV *first, GV *second, SV *a, SV *b) { int count, rv; dSP; ENTER; SAVETMPS; PUSHMARK(SP); PUTBACK; GvSV(first) = a; GvSV(second) = b; if( ( count = call_sv( cmp, G_SCALAR|G_NOARGS) ) != 1 ) croak( "Bad comparator" ); SPAGAIN; rv = POPi; FREETMPS; LEAVE; } void topNbs( int n, SV *cmp, AV*data) { SV **topN; GV *first, *second; int len = av_len( data ); int i, j, k; int left, right; Inline_Stack_Vars; Newz( 1, topN, n + 1, SV* ); for (i = 0; i < n+1; i++) { topN[i] = newSViv( 0 ); } first = gv_fetchpv("a", TRUE, SVt_PV); second = gv_fetchpv("b", TRUE, SVt_PV); for( i = 0; i <= len; i++ ) { SV *val = av_fetch( data, i, 0 ); int comp = callComp(cmp, first, second, val, topN[ n - 1]); if (comp <= 0) continue; left = 0; right = n - 1; while (left < right) { int middle = (left + right) >> 1; if (callComp(cmp, first, second, val, topN[ n - 1]) <= 0) { left = middle + 1; } else { right = middle; } } if( callComp(cmp, first, second, topN[ left ], val) < 0) { for( k = n; k > left; k-- ) topN[ k ] = topN[ k-1 ]; topN[ left ] = val; } } Inline_Stack_Reset; for( i = 0; i < n; i++ ) Inline_Stack_Push( sv_2mortal( newSVsv( topN[ i ] ) ) ); Safefree( topN ); Inline_Stack_Done; }
        Update: I must have been thinking in C++. I have moved the GV* declarations up. It still doesn't do what I want, (now it tells me the '<=>' operations are being applied to undefined values sometimes, then seg faults).

        Here is some sample output:

        Use of uninitialized value in numeric comparison (<=>) at top2.pl line + 11. Use of uninitialized value in numeric comparison (<=>) at top2.pl line + 11. Use of uninitialized value in numeric comparison (<=>) at top2.pl line + 11. Use of uninitialized value in numeric comparison (<=>) at top2.pl line + 11. Use of uninitialized value in numeric comparison (<=>) at top2.pl line + 11. 0 0 0 0 0 Attempt to free unreferenced scalar.