in reply to Re: Re: Regex: get first N characters but break at whitespace
in thread Regex: get first N characters but break at whitespace

Q: is there a processor saving in using substr rather than a regex?

In general yes, but not if you have to combine the substr with a regex (as MZSanford does here).

The benchmark shows that the pure regex approach suggested by tye is quickest for your problem, closely followed by japhy's version using fancier regex constructs. MZSanford's substr/substitute is slow (and a bit buggy, fixed that below :) because it tries to start the match at every interior whitespace. But you can improve on it: ($chunk) = substr($string,0,201) =~ /(.*)\s+\w*$/',

Here are the results of the benchmark:

Benchmark: running Hofmator, MZSanford, japhy, tye, each for at least +3 CPU seconds... Hofmator: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) @ 20 +6100.67/s (n=618302) MZSanford: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 55 +936.63/s (n=169488) japhy: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 25 +6036.67/s (n=768110) tye: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 29 +2146.67/s (n=876440)

generated by this code:

#!/usr/bin/perl use Benchmark qw/timethese/; $string = q/Some text repeated / x 50; timethese(-3, { MZSanford => '$chunk = substr($string,0,201);$chunk =~ s/\s+\w*$// +', Hofmator => '($chunk) = substr($string,0,201) =~ /(.*)\s+\w*$/', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/', });

-- Hofmator

Replies are listed 'Best First'.
Re: Re3: Regex: get first N characters but break at whitespace
by MZSanford (Curate) on Jan 15, 2002 at 16:35 UTC
    A very good point. Mine did not save alot. I did make one more stab, though i doubt it will out-run tye's regexp :
    (look ma, no regexp)
    # Code by itself : sub mz2 { $c = substr($string,0,201); $a = rindex($c,' '); ( $a == 201 ? $string : substr($string,0,$a)); } ## Bench addition : MZS2 => '($chunk) = &mz2()', ## Bench output : C:\WINDOWS\DESKTOP>perl index Benchmark: running Hofmator, MZS2, MZSanford, japhy, tye, each for at +least 3 CPU seconds... Hofmator: 4 wallclock secs ( 3.07 usr + 0.00 sys = 3.07 CPU) @ 378 +02.61/s (n=116054) MZS2: 4 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 442 +32.59/s (n=138448) MZSanford: 3 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 131 +59.87/s (n=41980) japhy: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 491 +81.79/s (n=153939) tye: 4 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 514 +03.64/s (n=155239)

    As expected, no better. But, worth the excersize. I did a quick test to make sure the output was the same, though, i would normally use a regexp, as it would be clearer in most cases than all this oddness. I do not know if this will work on all input cases, you milage may vary, etc, etc, etc ...

      Your idea relaxes the requirements somewhat as it uses literal space instead of whitespace. Furthermore you might be left with extra spaces at the end of your string.

      If both of these things don't matter then you are easily fastest like this (rindex can take a starting index as its 3rd argument!): $chunk = substr($string,0,rindex($string,' ',200));

      The benchmark showes a huge gain - but of course it's a bit unfair :) considering the different requirements ...

      Benchmark: running japhy, rindex, tye, each for at least 3 CPU seconds +... japhy: 2 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 24 +4478.67/s (n=733436) rindex: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 84 +2494.33/s (n=2527483) tye: 4 wallclock secs ( 3.39 usr + 0.00 sys = 3.39 CPU) @ 30 +5243.95/s (n=1034777)

      -- Hofmator

Re: Re3: Regex: get first N characters but break at whitespace
by gav^ (Curate) on Jan 15, 2002 at 20:27 UTC
    Sorry if I was misleading, I was timing using a regexp vs substr to get the first 200 characters of a string.

    The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line.

    original: [241] 'I need to extract the first several words from a para +graph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' japhy: [66] 'I need to extract the first several words from a paragrap +h of text' george: [241] 'I need to extract the first several words from a paragr +aph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' gav^: [194] 'I need to extract the first several words from a paragrap +h of text contained in a $var, so as to get the longest extract that' +s less than or equal to 200 characters. My brute-force-and-ignorance' Hofmator: [53] 'equal to 200 characters. My brute-force-and-ignorance'
    But as it turns out my code is 30 times slower than japhy's :)

    gav^

      The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line.

      This can be easily incorporated into the regex approaches. The problem so far is that .* doesn't match a newline. If you want that just add the /s modifier:

      timethese(-3, { Hofmator => '($chunk) = substr($string,0,201) =~ /(.*\S)\s/s', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/s;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/s', });

      -- Hofmator