Re^2: Weird performance issue with Strawberries and Inline::C

I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything?

At least it's easy to observe that memory is returned to OS upon return from a subroutine, where VLA (variable length array) is declared (and used) -- _if_ array gets large enough such as ~130K 64-bit integers. And _not_ returned to OS for smaller arrays. Don't know if/when other factors may be at play; and I doubt this factoid has any value. But it explains why the original "f" subroutine had been slower for 1e6 and longer input.

Let's aim higher than ~40% gain through tweaks and compiler voodoo -- let's get parallel!

use strict;
use warnings;
use feature 'say';
use Config;
use Benchmark qw/ timeit :hireswallclock /;

use lib '.';
use Max_score_subs ':all';

say "$^V / $Config{ archname } / $Config{ gccversion }";

my $str;
my %tests = (
                                                #   C:
    c        => sub { mscore_c( $str )},        # - dumb
    c_better => sub { mscore_c_better( $str )}, # - tweaked
    c_omp    => sub { mscore_c_omp( $str )},    # - "c", parallel
                                                #
                                                #   Fortran:
    c2f      => sub { mscore_c2f( $str )},      # - "c_better" (in F)
    f        => sub { mscore_f( $str )},        # - "PDL" (in F)
    f_omp    => sub { mscore_f_omp( $str )},    # - "f", parallel
);

my $iters = 2e5;
my $fmt = '%8.0f';

for my $L ( 1e4, 1e5, 1e6, 1e7, 1e8 ) {
    say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr;
    
    $str = '1' x $L;
    substr $str, rand $L, 1 , '0' for 1 .. $L;

    my $any;
    my %ret;
    for my $key ( keys %tests ) {
        my $result = $tests{ $key }-> ();
        die "<<< $key!!! >>>" unless $result == ( $any //= $result );
        
        my $t = timeit( $iters, $tests{ $key });
        $ret{ $key } = $t-> iters / $t-> real
    }
    
    print "             Rate/s    %\n";
    $fmt = '%8.1f' if $L > 1e6;
    for my $key ( sort { $ret{ $a } <=> $ret{ $b }} keys %ret ) {
        printf " %-9s $fmt %5.0f\n", 
            $key, $ret{ $key }, 100 * $ret{ $key } / $ret{ c }
    }
    $iters /= 10
}

__END__

v5.42.0 / MSWin32-x64-multi-thread / 13.2.0

String length: 10,000
             Rate/s    %
 f_omp        23804    28
 c_omp        25362    29
 c            86100   100
 c_better     91470   106
 c2f         105239   122
 f           120134   140

String length: 100,000
             Rate/s    %
 c             8724   100
 c_better      9324   107
 c2f          10686   122
 f            12139   139
 c_omp        15369   176
 f_omp        15964   183

String length: 1,000,000
             Rate/s    %
 c              875   100
 c_better       931   106
 c2f           1068   122
 f             1213   139
 c_omp         2306   264
 f_omp         2560   293

String length: 10,000,000
             Rate/s    %
 c             86.8   100
 c_better      92.2   106
 c2f          104.5   120
 f            115.6   133
 c_omp        349.6   403
 f_omp        426.9   492

String length: 100,000,000
             Rate/s    %
 c              8.6   100
 c_better       9.2   106
 c2f           10.4   120
 f             11.4   131
 c_omp         34.7   401
 f_omp         42.4   491
[download]

What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch. In hindsight, the improved "c_better" and its translation "c2f" were a trap. With their loops, it's not obvious how to split them in independent parts. OTOH, for the "dumb" C version it's not too difficult to split:

int mscore_c_omp( SV* str ) {
    STRLEN len;
    char* buf = SvPVbyte( str, len );
    len --;
    
    int nparts = omp_get_num_procs();
    if ( nparts > 4 ) nparts = 4;
    int psize = len / nparts;
    
    int acc_a[ nparts ];
    int cum_a[ nparts ];
    int max_a[ nparts ];
    
    #pragma omp parallel for schedule( static, 1 ) \
                             num_threads( nparts )
    for ( int j = 0; j < nparts; j ++ ) {
        int lo = j * psize;
        int hi = ( j == nparts - 1 ) ? len : lo + psize;
        
        int acc = 0;
        int cum = 0;
        int max = -1;
        
        for ( int i = lo; i < hi; i ++ ) {
            int one = buf[ i ] - '0';
            acc += one;
            cum += one ? -1 : 1;
            if ( cum > max ) max = cum;
        }
        acc_a[ j ] = acc;
        cum_a[ j ] = cum;
        max_a[ j ] = max;
    }
    
    int acc = acc_a[ 0 ]; 
    int max = max_a[ 0 ];
    
    for ( int j = 1; j < nparts; j ++ ) {
        acc += acc_a[ j ];
        
        cum_a[ j ] += cum_a[ j - 1 ];
        max_a[ j ] += cum_a[ j - 1 ];
        
        if ( max_a[ j ] > max ) max = max_a[ j ];
    }
    
    return acc + max + ( buf[ len ] == '1' );
}
[download]

In my tests, a string length of ~70k is a cut-off value where OMP versions begin to outrun the original versions. And it's better to just switch to originals rather than adding "if" directive to omp pragma, the above function is somewhat cumbersome and slow. Speaking about cumbersome, the Fortran OMP-enabled sub is a long way from original (elegant) PDL version. At least there remains ~25% gain over "c_omp", through the best of both worlds: parallel with OMP, and compiler voodoo with vectorization.

subroutine f_omp( n, s, ret )
    integer,     intent( in )  :: n
    integer * 1, intent( in )  :: s( n )
    integer,     intent( out ) :: ret
    
    integer              :: j, nparts, psize
    integer, allocatable :: acc_a ( : ), cum_a( : ), min_a( : )
    
    nparts = omp_get_num_procs()
    if ( nparts > 4 ) nparts = 4
    psize = n / nparts;
    
    allocate( acc_a( nparts ))
    allocate( cum_a( nparts ))
    allocate( min_a( nparts ))
    
    !$omp parallel do schedule( static, 1 ) num_threads( nparts )
    do j = 1, nparts ; block
        integer * 1, allocatable :: a( : )
        integer, allocatable     :: cum( : )
        
        integer :: lo, hi, d, nchunks
        integer :: pre, L, i, k
        integer :: acc_, cum_, min_
        
        lo = ( j - 1 ) * psize
        if ( j == nparts ) then
            hi = n - 1
        else
            hi = j * psize
        end if
        d = hi - lo
        nchunks = d / CHUNK
        if ( mod( d, CHUNK ) .ne. 0 ) nchunks = nchunks + 1
        
        acc_ = 0;
        cum_ = 0;
        min_ = 2;
        
        allocate( a( CHUNK ))
        allocate( cum( CHUNK ))
        
        do k = 1, nchunks
            pre = lo + ( k - 1 ) * CHUNK
            L = min( CHUNK, hi - pre )
            
            a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97
            cum( 1 ) = cum_ + a( 1 )
            do i = 2, L
                cum( i ) = cum( i - 1 ) + a( i )
            end do
            
            acc_ = acc_ + count( s( pre + 1 : pre + L ) &
                                == iachar( '1', 1 ))
            cum_ = cum( L )
            min_ = min( min_, minval( cum( 1 : L )))
        end do
        
        acc_a( j ) = acc_
        cum_a( j ) = cum_
        min_a( j ) = min_
        
    end block ; end do
    !$omp end parallel do
    
    do j = 2, nparts
        cum_a( j ) = cum_a( j ) + cum_a( j - 1 )
        min_a( j ) = min_a( j ) + cum_a( j - 1 )
    end do
    
    ret = sum( acc_a ) + ( s( n ) - iachar( '0', 1 )) &
          - minval( min_a )
end subroutine f_omp
[download]

Comment on Re^2: Weird performance issue with Strawberries and Inline::C Select or Download Code

Replies are listed 'Best First'.
Re^3: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 06, 2025 at 11:39 UTC
(OP here again, with long-forgotten PWC 342-2, exploring depths in muddy waters around 4-5 lines of trivial C) What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch Fun should never be denied: `v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 10,000 Rate/s % c 87663 100 va 622005 710 String length: 100,000 Rate/s % c 8875 100 c_omp 15439 174 va 71646 807 String length: 1,000,000 Rate/s % c 891 100 c_omp 2307 259 va 12770 1434 String length: 10,000,000 Rate/s % c 88.3 100 c_omp 361.2 409 va 1615.6 1829 String length: 100,000,000 Rate/s % c 8.8 100 c_omp 36.2 410 va 167.7 1900` [download] 'c' and 'c_omp' are for '`mscore_c`' (straightforward i.e. "dumb") and '`mscore_c_omp`' found in this thread. The 'va' stands for 'vectors, assembly'; where OMP (I have 4 cores) is "automatically" switched on for strings longer than 4e5. CPU is supposed to support AVX2 i.e. to be 2017+. There's a "cheat": sequential runs of either '0' or '1's, in target string, shouldn't be longer than 2*31. Because current (running) scores are maintained (dragged along) as 8 per 256-bit registers. Perhaps there is more optimal way to find (horizontal) cumulative sums along 16 bytes, other than 4 shifts/shuffles and 4 additions. What's really funny is that to find (rather, maintain) running maximum over 32 numbers I'm only doing 3 comparisons. But maybe all of the above (i.e. below) looks not funny but ugly to "people who know how". use Inline C => Config => ccflagsex => '-fopenmp', libs => '-lgomp -ldl'; use Inline C => << 'END_OF_C'; #include <omp.h> #define MANY 400000 #define MAX_CORES 4 void _helper_a( char buf, size_t len, int32_t* acc, int32_t* cum, int32_t* max ); void _helper_c( char* buf, size_t len, int32_t* acc, int32_t* cum, int32_t* max ); int mscore_va( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ncores = 1; if ( len >= MANY ) { int n = omp_get_num_procs(); ncores = n > MAX_CORES ? MAX_CORES : n; } int nparts = 2 + ncores; int32_t acc_a[ nparts ]; int32_t cum_a[ nparts ]; int32_t max_a[ nparts ]; memset( acc_a, 0x00, sizeof( acc_a )); memset( cum_a, 0x00, sizeof( cum_a )); memset( max_a, 0xFF, sizeof( max_a )); len --; int32_t prefix_len = ( size_t ) buf % 32; if ( len < prefix_len ) prefix_len = len; int32_t suffix_len = ( len - prefix_len ) % 32; int32_t aligned_len = len - prefix_len - suffix_len; char* prefix_start = buf; char* aligned_start = buf + prefix_len; char* suffix_start = aligned_start + aligned_len; if ( prefix_len > 0 ) { _helper_c( prefix_start, prefix_len, acc_a, cum_a, max_a ); } if ( suffix_len > 0 ) { _helper_c( suffix_start, suffix_len, &( acc_a[ nparts - 1 ]), &( cum_a[ nparts - 1 ]), &( max_a[ nparts - 1 ])); } if ( aligned_len > 0 ) { if ( ncores == 1 ) { _helper_a( aligned_start, aligned_len, &( acc_a[ 1 ]), &( cum_a[ 1 ]), &( max_a[ 1 ])); } else { size_t psize = len / 32 / ncores * 32; #pragma omp parallel for schedule( static, 1 ) \ num_threads( ncores ) for ( int j = 0; j < ncores; j ++ ) { char* start = aligned_start + j * psize; size_t len_ = ( j < ncores - 1 ) ? psize : aligned_len + - j * psize; _helper_a( start, len_, &( acc_a[ j + 1 ]), &( cum_a[ j + 1 ]), &( max_a[ j + 1 ])); } } } int32_t acc = acc_a[ 0 ]; int32_t max = max_a[ 0 ]; for ( int j = 1; j < nparts; j ++ ) { acc += acc_a[ j ]; cum_a[ j ] += cum_a[ j - 1 ]; max_a[ j ] += cum_a[ j - 1 ]; if ( max_a[ j ] > max ) max = max_a[ j ]; } return acc + max + ( buf[ len ] == '1' ); } void _helper_c( char* buf, size_t len, // in int32_t* acc, int32_t* cum, int32_t* max ) { // out for( int i = 0; i < len; i ++ ) { int one = buf[ i ] - '0'; acc += one; cum += one ? -1 : 1; if ( cum > max ) max = cum; } } void _helper_a( char* buf, size_t len, // in int32_t* acc, int32_t* cum, int32_t* max ) { // out void* fin = buf + len; int32_t acc_, cum_, max_; __asm__ ( // " IDDQD \n" " XOR %[acc], %[acc] \n" " MOV $0x00, %%r11 \n" " MOVQ %%r11, %%xmm0 \n" " VPBROADCASTB %%xmm0, %%ymm0 \n" // ymm0 = cum " MOV $0xFF, %%r11 \n" " MOVQ %%r11, %%xmm1 \n" " VPBROADCASTB %%xmm1, %%ymm1 \n" // ymm1 = max " MOV $0x61, %%r11 \n" " MOVQ %%r11, %%xmm2 \n" " VPBROADCASTB %%xmm2, %%ymm2 \n" // ymm2 = 97 x 32 " MOV $0x0F, %%r11 \n" " MOVQ %%r11, %%xmm6 \n" " VPBROADCASTB %%xmm6, %%ymm6 \n" // ymm6 = 15 x 32 " MOV %[buf], %%r8 \n" " MOV %[fin], %%r9 \n" " MOV $32, %%r10 \n" ".L2_: \n" " VMOVDQA (%%r8), %%ymm3 \n" " VPADDB %%ymm3, %%ymm3, %%ymm3 \n" " VPSUBB %%ymm3, %%ymm2, %%ymm3 \n" " VPMOVMSKB %%ymm3, %%ecx \n" " POPCNT %%ecx, %%ecx \n" " ADD %%ecx, %[acc] \n" " VPSLLDQ $1, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $2, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $4, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $8, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSHUFB %%ymm6, %%ymm3, %%ymm4 \n" " VPSHUFD $0b01001110, %%ymm3, %%ymm5 \n" " VPMAXSB %%ymm3, %%ymm5, %%ymm3 \n" " VPMOVSXBD %%xmm3, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm5 \n" " VPMAXSD %%ymm1, %%ymm5, %%ymm1 \n" " VPMOVSXBD %%xmm4, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm0 \n" " VPERMQ $0b01001110, %%ymm3, %%ymm3 \n" " VPERMQ $0b01001110, %%ymm4, %%ymm4 \n" " VPMOVSXBD %%xmm3, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm5 \n" " VPMAXSD %%ymm1, %%ymm5, %%ymm1 \n" " VPMOVSXBD %%xmm4, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm0 \n" " ADD %%r10, %%r8 \n" " CMP %%r9, %%r8 \n" " JL .L2_ \n" " VPERMQ $0b00111001, %%ymm1, %%ymm4 \n" " VPMAXSD %%ymm1, %%ymm4, %%ymm1 \n" " VPERMQ $0b01001110, %%ymm1, %%ymm4 \n" " VPMAXSD %%ymm1, %%ymm4, %%ymm1 \n" " PSHUFD $0b00111001, %%xmm1, %%xmm4 \n" " PMAXSD %%xmm4, %%xmm1 \n" " MOVD %%xmm1, %[max] \n" " MOVD %%xmm0, %[cum] \n" : [acc] "=r" ( acc_ ), [cum] "=rm" ( cum_ ), [max] "=rm" ( max_ ) : [buf] "m" ( buf ), [fin] "m" ( fin ) : "cc", "rcx", "r8", "r9", "r10", "r11", "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5", "ymm6" ); acc = acc_; cum = cum_; *max = max_; } END_OF_C [download]	[reply] [d/l] [select]
Re^4: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 09, 2025 at 17:41 UTC
Sorry, hopefully this will be the last instalment on that same PWC 342-2 -- because mission (kind of) accomplished, with symbolic and "all-important" order-of-magnitude speed gain over "plain" C on a single core, through a little re-arrangement, partial loop unrolling (step in 96 bytes instead of 32: sums of "-1" and "1"s won't overflow and can be kept as bytes a little longer) and different choice for some instructions (TIMTOWTDI, there's real zoo of them). Apologies (can't edit parent), also, for calling AVX2 as "2017+", of course it's older; and using "`len`" in place of "`aligned_len`" above, once: it doesn't affect speed nor result, but is just unclean. `String length: 10,000 Rate/s % c 88080 100 va_single 811975 922 String length: 100,000 Rate/s % c 8945 100 va_single 108085 1208 String length: 1,000,000 Rate/s % c 894 100 va_single 10802 1208 String length: 10,000,000 Rate/s % c 88.4 100 va_single 913.5 1033 String length: 100,000,000 Rate/s % c 8.8 100 va_single 83.4 946` [download] However, no gain (compared to "unoptimised" version in parent node) with OMP (same CPU with 4 cores): `String length: 100,000,000 Rate/s % c 8.8 100 va_omp 168.7 1910` [download] That, and relative decrease in advantage for very long input (no additional memory allocation occurs, but only the same simple processing of string chunks, sequentially), I can only speculate is result of throttling of some kind. And by the way, I also did try to place "`mscore_c`" C function in separate file, then compile it with "`-O3 -march=native`" (can't do it with Inline::C, can I?) which would optimise/vectorise to the best of compiler's "voodoo", as googling suggests; then link the library and call the wrapped function from Inline::C. Well, for uniform '"1"s only' string it gave ~25% gain, but for a "random" string it is 3 times slower (than "`-O2`" i.e. compiled directly from within Inline::C.) This is ridiculous, "voodoo", "A.I", "free" optimisations, and what not.	[reply] [d/l] [select]
Re^5: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 12, 2025 at 13:53 UTC
can't do it with Inline::C, can I? Yes I can, no need in separate source file; here to prove my claims from parent node: use strict; use warnings; use feature 'say'; use Benchmark 'cmpthese'; use Config; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $L = 1e6; my $str = '1' x $L; BEGIN { our $c = << 'END_OF_C'; int %s( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); len --; int acc = 0; int cum = 0; int max = -1; for( int i = 0; i < len; i ++ ) { int one = buf[ i ] - '0'; acc += one; cum += one ? -1 : 1; if ( cum > max ) max = cum; } return acc + max + ( buf[ len ] == '1' ); } END_OF_C } use Inline C => ' #pragma GCC target( "avx2" ) #pragma GCC optimize( "O3" ) ' . sprintf our $c, 'mscore_optimised'; use Inline C => sprintf our $c, 'mscore_default'; my %tests = ( default => sub { mscore_default( $str )}, optimised => sub { mscore_optimised( $str )}, ); say 'uniform string:'; cmpthese -1, \%tests; substr $str, rand $L, 1 , '0' for 1 .. $L; say 'random string:'; cmpthese -1, \%tests; __END__ v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 uniform string: Rate default optimised default 897/s -- -19% optimised 1111/s 24% -- random string: Rate optimised default optimised 245/s -- -73% default 897/s 266% -- [download]	[reply] [d/l]
Re^5: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 17, 2025 at 14:54 UTC
OT: Cumulative sums are fun. Here's its application (sum "-1" and "1"s, but with "saturation": don't fall below zero) for another task, PWC 346-1 "Longest Parenthesis". There are at least half a dozen identical (i.e. verbatim) Perl solutions, doesn't matter if it's because task is trivial, or "AI" was involved, or subroutine was copied. In fact, that Perl solution is significantly faster for medium and long inputs than a few others based on nested loops or regular expressions. Below is my translation of it to C. (I'm not fluent in C, can it be written better? Hopefully, I haven't put it in much disadvantage because of that.) Surprisingly, my "cumulative sum" solution is faster, though it involves second sweep over input and seemingly more arithmetic. For shorter, 1000 random strings 1000 chars long: `v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 Rate pwc sum pwc 199/s -- -46% sum 365/s 84% --` [download] And for set of 100 of longer strings (100_000 length): `Rate pwc sum pwc 21.4/s -- -56% sum 48.2/s 125% --` [download] Why C and, hence, OT? I had Perl prototype, somewhat faster than mentioned Perl PWC solution; then did PDL prototype (~ 10x .. 20x faster, of course), but had to write a small PP fragment for it; and thought why not do it in C completely then, anyway. Had no intention to translate PWC Perl to C for competition; but found C++ solution in GH PWC directory. Something to compile to machine code, hurray. Well, for a single 1e6 chars long string it took twenty seconds, no joking. (C functions, below, are in milliseconds). Perl code for the results above: use strict; use warnings; use feature 'say'; use Config; use Benchmark 'cmpthese'; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $L = 1e5; my $s = '(' x $L; my @samples; for ( 0 ... 99 ) { my $s1 = $s; for ( 0 .. $L - 1 ) { substr $s1, $_, 1, ')' if int rand 2 } push @samples, $s1; die unless longest_p_pwc( $s1 ) == longest_p_sum( $s1 ) } cmpthese -3, { pwc => sub { longest_p_pwc( $_ ) for @samples }, sum => sub { longest_p_sum( $_ ) for @samples }, }; sub valid_longest_parenthesis { my $s = shift; my @stack = (-1); my $max_len = 0; for my $i (0 .. length($s) - 1) { if (substr($s, $i, 1) eq "(") { push @stack, $i; } else { pop @stack; if (@stack) { $max_len = $max_len > ($i - $stack[-1]) ? $max_len : ($i - $stack[-1]); } else { push @stack, $i; # New starting point } } } return $max_len; } use Inline C => << 'END_OF_C'; int longest_p_pwc( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int* stack = malloc( len * sizeof( int )); int stack_i = 0; stack[ 0 ] = -1; int max = 0; for ( int i = 0; i < len; i ++ ) { if ( buf[ i ] == '(' ) stack[ ++ stack_i ] = i; else { if ( stack_i > 0 ) { stack_i --; int cur = i - stack[ stack_i ]; if ( cur > max ) max = cur; } else stack[ 0 ] = i; } } free( stack ); return max; } int longest_p_sum( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int* pad = malloc( len * sizeof( int )); int sum = 0; for ( int i = 0; i < len; i ++ ) pad[ i ] = sum = sum + !sum + ( buf[ i ] == '(' ? 1 : -1 ); int max = 0; int cur = 0; int counting = 0; int zeroless = 1; sum = 0; for ( int i = len - 1; i >= 0; i -- ) { zeroless = zeroless && pad[ i ]; sum = zeroless ? sum + !sum + ( buf[ i ] == ')' ? 1 : -1 ) : 1; if ( pad[ i ] && sum ) { cur ++; counting = 1; } else if ( counting ) { if ( cur > max ) max = cur; counting = 0; cur = 0; } } if ( cur > max ) max = cur; free( pad ); return max; } END_OF_C [download]	[reply] [d/l] [select]