Re: Regex: get first N characters but break at whitespace

Replies are listed 'Best First'.
Re: Re: Regex: get first N characters but break at whitespace by George_Sherston (Vicar) on Jan 14, 2002 at 21:11 UTC
This appeals to the thug in me :) Q: is there a processor saving in using substr rather than a regex? § George Sherston	[reply]
Re: Re: Re: Regex: get first N characters but break at whitespace by gav^ (Curate) on Jan 14, 2002 at 21:31 UTC
A massive amount, substr has a hell of a lot simpler job to do. `Benchmark: timing 500000 iterations of regexp, substr... regexp: 11 wallclock secs (10.16 usr + 0.00 sys = 10.16 CPU) +@ 49236.83/s (n=500000) substr: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) +@ 554938.96/s (n=500000)` [download] gav^	[reply] [d/l]
Re3: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 15, 2002 at 15:57 UTC
Q: is there a processor saving in using substr rather than a regex? In general yes, but not if you have to combine the substr with a regex (as MZSanford does here). The benchmark shows that the pure regex approach suggested by tye is quickest for your problem, closely followed by japhy's version using fancier regex constructs. MZSanford's substr/substitute is slow (and a bit buggy, fixed that below :) because it tries to start the match at every interior whitespace. But you can improve on it: `($chunk) = substr($string,0,201) =~ /(.)\s+\w$/',` Here are the results of the benchmark: `Benchmark: running Hofmator, MZSanford, japhy, tye, each for at least +3 CPU seconds... Hofmator: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) @ 20 +6100.67/s (n=618302) MZSanford: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 55 +936.63/s (n=169488) japhy: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 25 +6036.67/s (n=768110) tye: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 29 +2146.67/s (n=876440)` [download] generated by this code: `#!/usr/bin/perl use Benchmark qw/timethese/; $string = q/Some text repeated / x 50; timethese(-3, { MZSanford => '$chunk = substr($string,0,201);$chunk =~ s/\s+\w$// +', Hofmator => '($chunk) = substr($string,0,201) =~ /(.)\s+\w*$/', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/', });` [download] -- Hofmator	[reply] [d/l] [select]
Re: Re3: Regex: get first N characters but break at whitespace by MZSanford (Curate) on Jan 15, 2002 at 16:35 UTC
A very good point. Mine did not save alot. I did make one more stab, though i doubt it will out-run tye's regexp : _{(look ma, no regexp)} # Code by itself : sub mz2 { $c = substr($string,0,201); $a = rindex($c,' '); ( $a == 201 ? $string : substr($string,0,$a)); } ## Bench addition : MZS2 => '($chunk) = &mz2()', ## Bench output : C:\WINDOWS\DESKTOP>perl index Benchmark: running Hofmator, MZS2, MZSanford, japhy, tye, each for at +least 3 CPU seconds... Hofmator: 4 wallclock secs ( 3.07 usr + 0.00 sys = 3.07 CPU) @ 378 +02.61/s (n=116054) MZS2: 4 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 442 +32.59/s (n=138448) MZSanford: 3 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 131 +59.87/s (n=41980) japhy: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 491 +81.79/s (n=153939) tye: 4 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 514 +03.64/s (n=155239) [download] As expected, no better. But, worth the excersize. I did a quick test to make sure the output was the same, though, i would normally use a regexp, as it would be clearer in most cases than all this oddness. I do not know if this will work on all input cases, you milage may vary, etc, etc, etc ...	[reply] [d/l]
Re5: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 16, 2002 at 16:51 UTC
Re: Re3: Regex: get first N characters but break at whitespace by gav^ (Curate) on Jan 15, 2002 at 20:27 UTC
Sorry if I was misleading, I was timing using a regexp vs substr to get the first 200 characters of a string. The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line. original: [241] 'I need to extract the first several words from a para +graph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' japhy: [66] 'I need to extract the first several words from a paragrap +h of text' george: [241] 'I need to extract the first several words from a paragr +aph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' gav^: [194] 'I need to extract the first several words from a paragrap +h of text contained in a $var, so as to get the longest extract that' +s less than or equal to 200 characters. My brute-force-and-ignorance' Hofmator: [53] 'equal to 200 characters. My brute-force-and-ignorance' [download] But as it turns out my code is 30 times slower than japhy's :) gav^	[reply] [d/l]
Re5: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 16, 2002 at 16:55 UTC