Re^2: Remove all non alphanumeric characters excluding space, underscore and minus sign

Use Benchmark when in doubt. YMMV. The first thing I would do for speeding that up is adding a + after the character class, but below proves I'm wrong.

$ cat test.pl
#!/pro/bin/perl

use strict;
use warnings;

use Benchmark "cmpthese";

my $string  = pack "A*" => map { chr (32 + int rand 95) } 0..1280;

cmpthese (-1, {
    subsingle   => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]//g',
    subplus     => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]+//g',
    tran        => '(my $s = $string) =~ tr/a-zA-Z0-9 _-//cd',
    });
$ perl5.8.8 test.pl
               Rate      tran   subplus subsingle
tran      5938537/s        --       -1%       -2%
subplus   5983722/s        1%        --       -1%
subsingle 6052802/s        2%        1%        --
$ perl5.10.1 test.pl
               Rate      tran   subplus subsingle
tran      5242880/s        --       -2%       -5%
subplus   5344683/s        2%        --       -3%
subsingle 5531478/s        6%        3%        --
$ perl5.12.3 test.pl
               Rate      tran   subplus subsingle
tran      5044286/s        --       -2%       -4%
subplus   5144881/s        2%        --       -2%
subsingle 5242880/s        4%        2%        --
$ perl5.14.1 test.pl
               Rate      tran   subplus subsingle
tran      5144881/s        --       -9%      -10%
subplus   5663605/s       10%        --       -1%
subsingle 5716536/s       11%        1%        --
$
[download]

Enjoy, Have FUN! H.Merijn

Comment on Re^2: Remove all non alphanumeric characters excluding space, underscore and minus sign Select or Download Code

Replies are listed 'Best First'.
Re^3: Remove all non alphanumeric characters excluding space, underscore and minus sign (Benchmark--) by tye (Sage) on Feb 13, 2012 at 17:57 UTC
'use Benchmark' advice usually makes me shudder. As so often is the case, you have a tiny mistake in your code and so are benchmarking nearly identical do-nothing chunks of code. #!/usr/bin/perl -w use strict; use Benchmark "cmpthese"; my $string = pack "A" => map { chr (32 + int rand 95) } 0..1024000; cmpthese( -1, { subsingle => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]//g', subplus => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]+//g', tran => '(my $s = $string) =~ tr/a-zA-Z0-9 _-//cd', } ); warn "Second version\n"; cmpthese( -1, { subsingle => sub { (my $s = $string) =~ s/[^a-zA-Z0-9 _-]//g }, subplus => sub { (my $s = $string) =~ s/[^a-zA-Z0-9 _-]+//g }, tran => sub { (my $s = $string) =~ tr/a-zA-Z0-9 _-//cd }, } ); __END__ Use of uninitialized value in transliteration (tr///) at (eval 14) lin +e 1. [about a million warnings] Use of uninitialized value in transliteration (tr///) at (eval 140) li +ne 1. Second version Rate subplus subsingle tran subplus 114916/s -- -4% -14% subsingle 119259/s 4% -- -10% tran 133187/s 16% 12% -- Rate subplus subsingle tran subplus 2171607/s -- -4% -70% subsingle 2256550/s 4% -- -68% tran 7143583/s 229% 217% -- [download] The most important take-away from this should be that, even with Benchmark.pm going to extraordinary efforts to try to subtract out the "overhead", I had to resort to ridiculously long strings before it could really tell a difference between the three choices. So you* are not going to notice a difference. When something takes 0.0000004 seconds for an extraordinarily long string, making it take only 0.0000001 seconds rarely actually matters (especially when you don't have extraordinarily long strings), especially since, outside of Benchmark.pm's imagined view of things, the overhead of actually getting to the point of running the regex or tr/// is going to swamp that 0.0000001-second fiction. - tye	[reply] [d/l]
Re^4: Remove all non alphanumeric characters excluding space, underscore and minus sign (Benchmark--) by Eliya (Vicar) on Feb 13, 2012 at 19:27 UTC
There's still another "tiny mistake" with your code, which is that the following snippet doesn't generate a string of length 1024001, but a string of length 1: `my $string = pack "A" => map { chr (32 + int rand 95) } 0..1024000; print length($string);` [download] Done properly, i.e. either using `pack "(A)", ...`, or `join '', ...`, or the much more memory-friendly `my $string; $string .= chr(32 + int rand 95) for 0..1024000;` [download] I get the following quite different results: `Rate subsingle subplus tran subsingle 18.5/s -- -15% -93% subplus 21.9/s 18% -- -92% tran 268/s 1345% 1121% --` [download]	[reply] [d/l] [select]
Re^5: Remove all non alphanumeric characters excluding space, underscore and minus sign (Benchmark--) by tye (Sage) on Feb 13, 2012 at 20:01 UTC
Thanks. I was short on time and posted with code I wouldn't use myself because it seemed to demonstrate the problem with the prior code. When I use Benchmark myself, I arrange for a way to test the code being benchmarked for exactly these types of reasons. I was fooled by seeing a 200% difference but I still should've rejected the code when the run time per operation was that low. Sorry for posting "in a hurry". Thanks for the correction. :) - tye	[reply]
Re^6: Remove all non alphanumeric characters excluding space, underscore and minus sign (Benchmark--) by Tux (Canon) on Feb 13, 2012 at 22:22 UTC