Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Re: Re: Efficient run determination.

by BrowserUk (Patriarch)
on Nov 15, 2002 at 10:14 UTC ( [id://213104]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Efficient run determination.
in thread Efficient run determination.

Nice one. Now send me a version that I don't have to compile:)

Seriously, thanks for doing the benchmark. I am doing the same right now, but am having trouble ensuring that I'm comparing apples with apples. I have one (perl) solution that hasn't been posted yet which might do better, but it's very difficult to benchmark as it relies on @- to get the positions and lengths and this system var goes out of scope very quickly.

Here is my simplistic test program run on a 1920 char string. One run only and a different string from yours so direct comparison is dodgey, but it finds 351 triplets in the 1920 chars string in a little under 0.4 of a second, which may mean it isn't doing so bad by comparison with your C version. Maybe you could include it into you benchmark (HARD!) or at least time one run of it using the same data as you did for your other tests for comparison.

#! perl -slw use strict; use re 'eval'; #! Took me ages to find this. use Benchmark::Timer; my $t = new Benchmark::Timer; #! Set up big regex. 1-time hit. my $re ='(?:(.)(??{"$+*"}))?' x 500; $re = qr/$re/o; # Test data(+2) x 32 to make it more comparable $_ = ' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddddddddd' x +32; #! All the data is gathered here in one shot. #! Chars in @c - start positions in @-, lengths derived: length N = $- +[N+2] - $-[N+1] #! Caveat. lengths must be used immediately (or stored elsewhere which + costs) else the go stale:) $t->start('bigre'); my @c = m/$re/; #! THIS LINE DOES ALL THE WORK. #! This truncates the list to exclude null matches returned from regex +. $#c = $#- -1; $t->stop('bigre'); #! data accessed here. printf "('%1s', %3d, %3d)\n", $c[$_], $-[$_+1], ( $-[$_+2] || $+[0] ) +- $-[$_+1] for 0..$#c; print $#c, ' triplets found in '; $t->report; __END__ C:\test>212796-2 (' ', 0, 3) ('a', 3, 4) (' ', 7, 3) .... 345 values omitted .... ('b', 1892, 5) ('c', 1897, 1) ('d', 1898, 22) 351 triplets found in 1 trial of bigre (371ms total) C:\test>

The basic idea was to push the loop into the regex engine instead of external to it, and use the regex engines well tuned optimisations to do the data gathering for me. I originally though that using the /g modifier would do the trick, but it only returns the positions for the last match made. Then I thought of using a repeated regex ({1..n}) but again this is treated such that only the positions for the last capture group are available.

The answer was to construct a single large regex with enough capture groups to gather the data I needed. This was made harder by the need to only capture the first char of the group, but also to know its end position. That's when I discovered the (??{}) re and combined this with $+ in order to achieve my aims.

I'm not sure that this is good or maintainable Perl, but it seems to work and appears on preliminary tests to be very fast, which was and is the only criteria by which I am judging it.


Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Replies are listed 'Best First'.
Re: Re: Re: Re: Efficient run determination.
by PhiRatE (Monk) on Nov 15, 2002 at 11:03 UTC
    As I noted, C is the language for the job. It doesn't really matter how efficient the regex solution, C can work on the data in-place without incurring much in the way of dynamic allocation costs. You have mitigated this disadvantage to an extent in your version by placing certain limitations on the algorithm, notably the 500 unit capture max. This changes the situation considerably. However the design you chose isn't competitive unfortunately :/ primarily I suspect because its not an optimised line within the re engine to be used like this, although that is conjecture sicne i'm not familiar with the internals.

    Length: 1920
    Enlil 2:    0.93203s
    Dingus 1:   0.530455s
    Dingus 2:   0.537765s
    Rasta 1:    1.973259s
    TommyW 1:   0.87257s
    Robartes 1: 0.996671s
    PhiRatE:    0.232084s
    BrowserUk:  2.764623s
    

    edit: note that this is for 100 iterations

    With the code done like this:

    # BrowserUk $t0 = [gettimeofday]; #! Set up big regex. 1-time hit. my $re ='(?:(.)(??{"$+*"}))?' x 500; $re = qr/$re/o; for (1..100) { @res = browseruk($stn); } print "BrowserUk: ".tv_interval( $t0 )."\n"; sub browseruk { $_ = shift; my @c = m/$re/; #! THIS LINE DOES ALL THE WORK. #! This truncates the list to exclude null matches returned from r +egex. $#c = $#- -1; return \@c; }

    I anticipated that perhaps the startup cost of the regex generation might be causing the performance problem, so I ran a 1000 unit test as well, against only my entry.

    Length: 1920
    PhiRatE:    2.241276s
    BrowserUk: 28.384692s
    

    Note that you got similar results, I'm running 100-1000 runs of the same line, you only did one :)

    My recommendation is either go with the Inline C one if you really need the speed, or Dingus' 2nd variant, which is the cleanest, closest perl variant.

      I don't agree with your timing method. Including the regex generation into the timing is wrong. I the real application, the regex would be set up once at the beginning of the program or even pregenerated and read from a file or even embedded in the source. It can then be reused over and over.

      In the application for which this is destined for use, it would be used on each of 500 x 500 char strings, and this process it repeated as fast as the data buffer can be filled with the aim of getting the whole down to less than 1 second, hence the need for speed.

      Would you consider timing the repetition loop without the regex setup which is basically a compile time cost not a runtime one? Without the start up costs, I think you'll see a different picture.

      As for the limitation on the number of matches, I used 500 as that is the limit of the string length for my application, but again this would be adjusted to need at startup.


      Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
      Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
      Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
      Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

        The cost of the startup thing is only included once, and even without it it makes no difference, the timing is the same, the startup cost is miniscule in comparison to the cost of the rest of the process.

        If you don't agree with me, feel free to run your own benchmarks, my program is included at the end of this message. Your algorithm, while interesting, is one of the slowest.

        Iterations: 100
        Length: 1920
        PhiRatE 1:  0.236097s  Perl/C
        PhiRatE 3:  0.234754s  Perl/C
        Dingus 1:   0.541398s  Perl
        PhiRatE 2:  0.543576s  Perl
        Dingus 2:   0.580746s  Perl + RE
        TommyW 1:   0.897865s  Perl + RE
        Enlil 2:    0.964746s  Perl + RE
        Robartes 1: 1.021243s  Perl
        Rasta 1:    2.015298s  Perl + RE
        BrowserUk:  2.764815s  Perl + RE
        

        Code for my benchmarking is here. Feel free to fiddle around to your liking.

        use Data::Dumper; use Time::HiRes qw( usleep ualarm gettimeofday tv_interval ); use re 'eval'; $stn = "aaaaaaammm38fdkkkkkkkk3,,,,,,,,,,sad909999999994lkllllllllllll +lz,,,,,,,,,dd888888882jk2kkd8d888d8djkjkjkjkkk3kk4k5kkkk65"; $iterations = 500; for (1..4) { $stn.=$stn; } print "Iterations: $iterations\n";~ print "Length: ".length($stn)."\n"; # Enlil 2 $t0 = [gettimeofday]; for (1..$iterations) { @res = enlil_2($stn); } print "Enlil 2: ".tv_interval( $t0 )."\n"; # Dingus 1 $t0 = [gettimeofday]; for (1..$iterations) { @res = dingus_1($stn); } print "Dingus 1: ".tv_interval( $t0 )."\n"; # Rasta 1 $t0 = [gettimeofday]; for (1..$iterations) { @res = rasta_1($stn); } print "Rasta 1: ".tv_interval( $t0 )."\n"; # TommyW 1 $t0 = [gettimeofday]; for (1..$iterations) { @res = tommyw_1($stn); } print "TommyW 1: ".tv_interval( $t0 )."\n"; # Robartes 1 $t0 = [gettimeofday]; for (1..$iterations) { @res = robartes_1($stn); } print "Robartes 1: ".tv_interval( $t0 )."\n"; # PhiRatE 1 $t0 = [gettimeofday]; for (1..$iterations) { @res = p_process($stn); } print "PhiRatE 1: ".tv_interval( $t0 )."\n"; # BrowserUk $t0 = [gettimeofday]; #! Set up big regex. 1-time hit. my $re ='(?:(.)(??{"$+*"}))?' x 500; $re = qr/$re/o; for (1..$iterations) { @res = browseruk($stn); } print "BrowserUk: ".tv_interval( $t0 )."\n"; # Dingus 2 $t0 = [gettimeofday]; for (1..$iterations) { @res = dingus_2($stn); } print "Dingus 2: ".tv_interval( $t0 )."\n"; # PhiRatE 2 $t0 = [gettimeofday]; for (1..$iterations) { @res = phirate_2($stn); } print "PhiRatE 2: ".tv_interval( $t0 )."\n"; # PhiRatE 3 $t0 = [gettimeofday]; for (1..$iterations) { @res = p_process_2($stn); } print "PhiRatE 3: ".tv_interval( $t0 )."\n"; sub browseruk { $_ = shift; my @c = m/$re/; #! THIS LINE DOES ALL THE WORK. #! This truncates the list to exclude null matches returned from r +egex. $#c = $#- -1; return \@c; } sub enlil_2 { my $string = shift; my @bah; while ($string =~ /((.)\2*)/g) { push (@bah, [$2,$-[1],$+[1] - $-[1]]); } return \@bah; } sub dingus_1 { my $string = shift; my (@res, $c, $p, $i); $p = 0; $c = substr($string,$p,1); for ($i=1; $i<length($string); $i++) { next if ($c eq substr($string,$i,1)); push (@res, [$c,$p,($i-$p)]); $c = substr($string,$i,1); $p = $i; } push (@res, [$c,$p,($i-$p)]); return \@res; } sub dingus_2 { my $string = shift; my (@res, $i); $i = 0; while ($string =~ /(.)\1*/g) { push (@res, [$1, $i, pos($string)-$i]); $i = pos($string); } return \@res; } sub rasta_1 { my $string = shift; my ($pp, $l, @res); $l = length($string); $pp = 0; while ($pp < $l) { $c = substr $string, $pp, 1; if ($string =~ /\G\Q$c\E+/gc) { push @res,[$c,$pp,pos($string) - $pp]; $pp = pos($string); } } return \@res; } sub tommyw_1 { my $string = shift; my $pos=0; my @triples=(); my @reps=$string=~/((.)\2*)/g; while (@reps) { my $hits=shift @reps; my $char=shift @reps; push @triples, [$char, $pos, length $hits]; $pos+=length $hits; } return \@triples; } sub robartes_1 { my $string = shift; my @res; my @listedstring= split//,$string; my $prev=shift @listedstring; my $currstart=my $index=0; for (@listedstring) { if ($_ eq $prev) { $index++; } else { push @res, [$prev, $currstart, $index-$currstart+1]; $currstart=++$index; $prev=$_; } } push @res, [$prev, $currstart, $index-$currstart+1]; return \@res; } sub phirate_2 { $_ = shift; my @res; my $count=0; my ($prev, $next); my $i=0; $prev = $next = chop($_); while ($next || $prev) { if ($prev eq $next) { $count++; } else { push @res,[$prev, $i=$count, $count]; $prev = $next; $count = 1; } $i++; $next = chop; } return \@res; } use Inline C => <<'END_OF_C_CODE'; void p_process(char *s) { char prev = 0; long count = 0; long pos = 0; long i=0; AV *array; Inline_Stack_Vars; Inline_Stack_Reset; while((*s != 0) || (prev != 0)) { if (count==0) { pos = i; prev = *s; count = 1; } else if (prev == *s) { count++; } else { array = newAV(); av_push(array,newSVpvn(&prev,1)); av_push(array,newSViv(pos)); av_push(array,newSViv(count)); Inline_Stack_Push(newRV_inc(array)); pos=i; prev = *s; count=1; } i++; s++; } Inline_Stack_Done; } void p_process_2(char *s) { char prev = 0; long count = 0; long i=0; AV *array; Inline_Stack_Vars; Inline_Stack_Reset; prev = *s; while((*s != 0) || (prev != 0)) { if (prev == *s) { count++; } else { array = newAV(); av_push(array,newSVpvn(&prev,1)); av_push(array,newSViv(i-count)); av_push(array,newSViv(count)); Inline_Stack_Push(newRV_inc(array)); prev = *s; count=1; } i++; s++; } Inline_Stack_Done; } END_OF_C_CODE

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://213104]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found