in reply to Regex: get first N characters but break at whitespace

I use force, but slightly less brute ;)
(untested code ahead)
my $tmp = substr($_,0,200); $tmp =~ s/\s+\w*$//g;

$ perl -e 'do() || ! do() ;' Undefined subroutine &main::try

Replies are listed 'Best First'.
Re: Re: Regex: get first N characters but break at whitespace
by George_Sherston (Vicar) on Jan 14, 2002 at 21:11 UTC
    This appeals to the thug in me :) Q: is there a processor saving in using substr rather than a regex?

    § George Sherston
      A massive amount, substr has a hell of a lot simpler job to do.
      Benchmark: timing 500000 iterations of regexp, substr... regexp: 11 wallclock secs (10.16 usr + 0.00 sys = 10.16 CPU) +@ 49236.83/s (n=500000) substr: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) +@ 554938.96/s (n=500000)

      gav^

      Q: is there a processor saving in using substr rather than a regex?

      In general yes, but not if you have to combine the substr with a regex (as MZSanford does here).

      The benchmark shows that the pure regex approach suggested by tye is quickest for your problem, closely followed by japhy's version using fancier regex constructs. MZSanford's substr/substitute is slow (and a bit buggy, fixed that below :) because it tries to start the match at every interior whitespace. But you can improve on it: ($chunk) = substr($string,0,201) =~ /(.*)\s+\w*$/',

      Here are the results of the benchmark:

      Benchmark: running Hofmator, MZSanford, japhy, tye, each for at least +3 CPU seconds... Hofmator: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) @ 20 +6100.67/s (n=618302) MZSanford: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 55 +936.63/s (n=169488) japhy: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 25 +6036.67/s (n=768110) tye: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 29 +2146.67/s (n=876440)

      generated by this code:

      #!/usr/bin/perl use Benchmark qw/timethese/; $string = q/Some text repeated / x 50; timethese(-3, { MZSanford => '$chunk = substr($string,0,201);$chunk =~ s/\s+\w*$// +', Hofmator => '($chunk) = substr($string,0,201) =~ /(.*)\s+\w*$/', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/', });

      -- Hofmator

        A very good point. Mine did not save alot. I did make one more stab, though i doubt it will out-run tye's regexp :
        (look ma, no regexp)
        # Code by itself : sub mz2 { $c = substr($string,0,201); $a = rindex($c,' '); ( $a == 201 ? $string : substr($string,0,$a)); } ## Bench addition : MZS2 => '($chunk) = &mz2()', ## Bench output : C:\WINDOWS\DESKTOP>perl index Benchmark: running Hofmator, MZS2, MZSanford, japhy, tye, each for at +least 3 CPU seconds... Hofmator: 4 wallclock secs ( 3.07 usr + 0.00 sys = 3.07 CPU) @ 378 +02.61/s (n=116054) MZS2: 4 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 442 +32.59/s (n=138448) MZSanford: 3 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 131 +59.87/s (n=41980) japhy: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 491 +81.79/s (n=153939) tye: 4 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 514 +03.64/s (n=155239)

        As expected, no better. But, worth the excersize. I did a quick test to make sure the output was the same, though, i would normally use a regexp, as it would be clearer in most cases than all this oddness. I do not know if this will work on all input cases, you milage may vary, etc, etc, etc ...
        Sorry if I was misleading, I was timing using a regexp vs substr to get the first 200 characters of a string.

        The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line.

        original: [241] 'I need to extract the first several words from a para +graph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' japhy: [66] 'I need to extract the first several words from a paragrap +h of text' george: [241] 'I need to extract the first several words from a paragr +aph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' gav^: [194] 'I need to extract the first several words from a paragrap +h of text contained in a $var, so as to get the longest extract that' +s less than or equal to 200 characters. My brute-force-and-ignorance' Hofmator: [53] 'equal to 200 characters. My brute-force-and-ignorance'
        But as it turns out my code is 30 times slower than japhy's :)

        gav^