Remove all non alphanumeric characters excluding space, underscore and minus sign

ikkeniet has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remove all non alphanumeric characters excluding space, underscore and minus sign by moritz (Cardinal) on Feb 13, 2012 at 12:35 UTC
`s/[^a-zA-Z0-9 _-]//g` You can also use tr for such removals, should be faster. See also: perlretut Perl 6 - second systems done right	[reply] [d/l]
Re^2: Remove all non alphanumeric characters excluding space, underscore and minus sign by Tux (Canon) on Feb 13, 2012 at 15:20 UTC
Use Benchmark when in doubt. YMMV. The first thing I would do for speeding that up is adding a `+` after the character class, but below proves I'm wrong. $ cat test.pl #!/pro/bin/perl use strict; use warnings; use Benchmark "cmpthese"; my $string = pack "A*" => map { chr (32 + int rand 95) } 0..1280; cmpthese (-1, { subsingle => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]//g', subplus => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]+//g', tran => '(my $s = $string) =~ tr/a-zA-Z0-9 _-//cd', }); $ perl5.8.8 test.pl Rate tran subplus subsingle tran 5938537/s -- -1% -2% subplus 5983722/s 1% -- -1% subsingle 6052802/s 2% 1% -- $ perl5.10.1 test.pl Rate tran subplus subsingle tran 5242880/s -- -2% -5% subplus 5344683/s 2% -- -3% subsingle 5531478/s 6% 3% -- $ perl5.12.3 test.pl Rate tran subplus subsingle tran 5044286/s -- -2% -4% subplus 5144881/s 2% -- -2% subsingle 5242880/s 4% 2% -- $ perl5.14.1 test.pl Rate tran subplus subsingle tran 5144881/s -- -9% -10% subplus 5663605/s 10% -- -1% subsingle 5716536/s 11% 1% -- $ [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Remove all non alphanumeric characters excluding space, underscore and minus sign (Benchmark--) by tye (Sage) on Feb 13, 2012 at 17:57 UTC
'use Benchmark' advice usually makes me shudder. As so often is the case, you have a tiny mistake in your code and so are benchmarking nearly identical do-nothing chunks of code. #!/usr/bin/perl -w use strict; use Benchmark "cmpthese"; my $string = pack "A" => map { chr (32 + int rand 95) } 0..1024000; cmpthese( -1, { subsingle => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]//g', subplus => '(my $s = $string) =~ s/[^a-zA-Z0-9 _-]+//g', tran => '(my $s = $string) =~ tr/a-zA-Z0-9 _-//cd', } ); warn "Second version\n"; cmpthese( -1, { subsingle => sub { (my $s = $string) =~ s/[^a-zA-Z0-9 _-]//g }, subplus => sub { (my $s = $string) =~ s/[^a-zA-Z0-9 _-]+//g }, tran => sub { (my $s = $string) =~ tr/a-zA-Z0-9 _-//cd }, } ); __END__ Use of uninitialized value in transliteration (tr///) at (eval 14) lin +e 1. [about a million warnings] Use of uninitialized value in transliteration (tr///) at (eval 140) li +ne 1. Second version Rate subplus subsingle tran subplus 114916/s -- -4% -14% subsingle 119259/s 4% -- -10% tran 133187/s 16% 12% -- Rate subplus subsingle tran subplus 2171607/s -- -4% -70% subsingle 2256550/s 4% -- -68% tran 7143583/s 229% 217% -- [download] The most important take-away from this should be that, even with Benchmark.pm going to extraordinary efforts to try to subtract out the "overhead", I had to resort to ridiculously long strings before it could really tell a difference between the three choices. So you* are not going to notice a difference. When something takes 0.0000004 seconds for an extraordinarily long string, making it take only 0.0000001 seconds rarely actually matters (especially when you don't have extraordinarily long strings), especially since, outside of Benchmark.pm's imagined view of things, the overhead of actually getting to the point of running the regex or tr/// is going to swamp that 0.0000001-second fiction. - tye	[reply] [d/l]
Re^4: Remove all non alphanumeric characters excluding space, underscore and minus sign (Benchmark--) by Eliya (Vicar) on Feb 13, 2012 at 19:27 UTC
Re^5: Remove all non alphanumeric characters excluding space, underscore and minus sign (Benchmark--) by tye (Sage) on Feb 13, 2012 at 20:01 UTC
Some notes below your chosen depth have not been shown here
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Remove all non alphanumeric characters excluding space, underscore and minus sign by BrowserUk (Patriarch) on Feb 13, 2012 at 12:43 UTC
All accents should be removed too You mean that you want the accented characters removed? Or just the accents from the accented characters? If the latter, you'll need to explain the source of the strings and probably get into messing with encodings as a regex won't (easily) do that. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^2: Remove all non alphanumeric characters excluding space, underscore and minus sign by CountZero (Bishop) on Feb 13, 2012 at 13:25 UTC
Text::Unidecode can remove the accents, but leave the basic character in place. First run the ~~`undidecode`~~ `unidecode` function on your string and then apply the regex. Update: fixed a typo. Thanks BrowserUK. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^3: Remove all non alphanumeric characters excluding space, underscore and minus sign by BrowserUk (Patriarch) on Feb 13, 2012 at 13:49 UTC
Will it work on Extended ANSI codepages? Or only Unicoded input? Plus, it might be better to tell the OP rather than me since he's the one looking for it. (ps. Is undidecoding, extracting that which makes Diddy men what they are, from their DNA? :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^4: Remove all non alphanumeric characters excluding space, underscore and minus sign by CountZero (Bishop) on Feb 13, 2012 at 14:17 UTC
Re^5: Remove all non alphanumeric characters excluding space, underscore and minus sign by BrowserUk (Patriarch) on Feb 13, 2012 at 14:24 UTC
Some notes below your chosen depth have not been shown here
Re^3: Remove all non alphanumeric characters excluding space, underscore and minus sign by ikkeniet (Acolyte) on Feb 13, 2012 at 13:49 UTC
CountZero, thanks a lot! you made my day :-)	[reply]