Re: Re: Regex: get first N characters but break at whitespace

Replies are listed 'Best First'.
Re: Re: Re: Regex: get first N characters but break at whitespace by gav^ (Curate) on Jan 14, 2002 at 21:31 UTC
A massive amount, substr has a hell of a lot simpler job to do. `Benchmark: timing 500000 iterations of regexp, substr... regexp: 11 wallclock secs (10.16 usr + 0.00 sys = 10.16 CPU) +@ 49236.83/s (n=500000) substr: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) +@ 554938.96/s (n=500000)` [download] gav^	[reply] [d/l]
Re3: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 15, 2002 at 15:57 UTC
Q: is there a processor saving in using substr rather than a regex? In general yes, but not if you have to combine the substr with a regex (as MZSanford does here). The benchmark shows that the pure regex approach suggested by tye is quickest for your problem, closely followed by japhy's version using fancier regex constructs. MZSanford's substr/substitute is slow (and a bit buggy, fixed that below :) because it tries to start the match at every interior whitespace. But you can improve on it: `($chunk) = substr($string,0,201) =~ /(.)\s+\w$/',` Here are the results of the benchmark: `Benchmark: running Hofmator, MZSanford, japhy, tye, each for at least +3 CPU seconds... Hofmator: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) @ 20 +6100.67/s (n=618302) MZSanford: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 55 +936.63/s (n=169488) japhy: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 25 +6036.67/s (n=768110) tye: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 29 +2146.67/s (n=876440)` [download] generated by this code: `#!/usr/bin/perl use Benchmark qw/timethese/; $string = q/Some text repeated / x 50; timethese(-3, { MZSanford => '$chunk = substr($string,0,201);$chunk =~ s/\s+\w$// +', Hofmator => '($chunk) = substr($string,0,201) =~ /(.)\s+\w*$/', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/', });` [download] -- Hofmator	[reply] [d/l] [select]
Re: Re3: Regex: get first N characters but break at whitespace by MZSanford (Curate) on Jan 15, 2002 at 16:35 UTC
A very good point. Mine did not save alot. I did make one more stab, though i doubt it will out-run tye's regexp : _{(look ma, no regexp)} # Code by itself : sub mz2 { $c = substr($string,0,201); $a = rindex($c,' '); ( $a == 201 ? $string : substr($string,0,$a)); } ## Bench addition : MZS2 => '($chunk) = &mz2()', ## Bench output : C:\WINDOWS\DESKTOP>perl index Benchmark: running Hofmator, MZS2, MZSanford, japhy, tye, each for at +least 3 CPU seconds... Hofmator: 4 wallclock secs ( 3.07 usr + 0.00 sys = 3.07 CPU) @ 378 +02.61/s (n=116054) MZS2: 4 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 442 +32.59/s (n=138448) MZSanford: 3 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 131 +59.87/s (n=41980) japhy: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 491 +81.79/s (n=153939) tye: 4 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 514 +03.64/s (n=155239) [download] As expected, no better. But, worth the excersize. I did a quick test to make sure the output was the same, though, i would normally use a regexp, as it would be clearer in most cases than all this oddness. I do not know if this will work on all input cases, you milage may vary, etc, etc, etc ...	[reply] [d/l]
Re5: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 16, 2002 at 16:51 UTC
Your idea relaxes the requirements somewhat as it uses literal space instead of whitespace. Furthermore you might be left with extra spaces at the end of your string. If both of these things don't matter then you are easily fastest like this (rindex can take a starting index as its 3rd argument!): `$chunk = substr($string,0,rindex($string,' ',200));` The benchmark showes a huge gain - but of course it's a bit unfair :) considering the different requirements ... `Benchmark: running japhy, rindex, tye, each for at least 3 CPU seconds +... japhy: 2 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 24 +4478.67/s (n=733436) rindex: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 84 +2494.33/s (n=2527483) tye: 4 wallclock secs ( 3.39 usr + 0.00 sys = 3.39 CPU) @ 30 +5243.95/s (n=1034777)` [download] -- Hofmator	[reply] [d/l] [select]
Re: Re3: Regex: get first N characters but break at whitespace by gav^ (Curate) on Jan 15, 2002 at 20:27 UTC
Sorry if I was misleading, I was timing using a regexp vs substr to get the first 200 characters of a string. The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line. original: [241] 'I need to extract the first several words from a para +graph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' japhy: [66] 'I need to extract the first several words from a paragrap +h of text' george: [241] 'I need to extract the first several words from a paragr +aph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' gav^: [194] 'I need to extract the first several words from a paragrap +h of text contained in a $var, so as to get the longest extract that' +s less than or equal to 200 characters. My brute-force-and-ignorance' Hofmator: [53] 'equal to 200 characters. My brute-force-and-ignorance' [download] But as it turns out my code is 30 times slower than japhy's :) gav^	[reply] [d/l]
Re5: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 16, 2002 at 16:55 UTC
The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line. This can be easily incorporated into the regex approaches. The problem so far is that .* doesn't match a newline. If you want that just add the /s modifier: `timethese(-3, { Hofmator => '($chunk) = substr($string,0,201) =~ /(.*\S)\s/s', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/s;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/s', });` [download] -- Hofmator	[reply] [d/l]