in reply to Re3: Regex: get first N characters but break at whitespace
in thread Regex: get first N characters but break at whitespace

Sorry if I was misleading, I was timing using a regexp vs substr to get the first 200 characters of a string.

The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line.

original: [241] 'I need to extract the first several words from a para +graph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' japhy: [66] 'I need to extract the first several words from a paragrap +h of text' george: [241] 'I need to extract the first several words from a paragr +aph of text contained in a $var, so as to get the longest extract that's less than + or equal to 200 characters. My brute-force-and-ignorance method is: blah some more text here etc etc' gav^: [194] 'I need to extract the first several words from a paragrap +h of text contained in a $var, so as to get the longest extract that' +s less than or equal to 200 characters. My brute-force-and-ignorance' Hofmator: [53] 'equal to 200 characters. My brute-force-and-ignorance'
But as it turns out my code is 30 times slower than japhy's :)

gav^

Replies are listed 'Best First'.
Re5: Regex: get first N characters but break at whitespace
by Hofmator (Curate) on Jan 16, 2002 at 16:55 UTC
    The thing is, it doesn't matter how fast you are if you are giving the incorrect result. I was taking into account that a paragraph would have a newline on the end of each line.

    This can be easily incorporated into the regex approaches. The problem so far is that .* doesn't match a newline. If you want that just add the /s modifier:

    timethese(-3, { Hofmator => '($chunk) = substr($string,0,201) =~ /(.*\S)\s/s', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/s;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/s', });

    -- Hofmator