Regex: get first N characters but break at whitespace

George_Sherston has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex: get first N characters but break at whitespace by japhy (Canon) on Jan 14, 2002 at 20:14 UTC
Look-ahead is your friend. `($chunk) = /^(.{1,200})(?<!\s)(?!\w)/;` That matches as many (up to 200) characters, such that the last character is NOT whitespace, and the next character is NOT a word character. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
(tye)Re: Regex: get first N characters but break at whitespace by tye (Sage) on Jan 14, 2002 at 20:34 UTC
`my( $start )= /^(.{0,199}\S)\s/;` - tye (but my friends call me "Tye")	[reply] [d/l]
Re: Regex: get first N characters but break at whitespace by MZSanford (Curate) on Jan 14, 2002 at 20:23 UTC
I use force, but slightly less brute ;) _{(untested code ahead)} `my $tmp = substr($_,0,200); $tmp =~ s/\s+\w*$//g;` [download] `$ perl -e 'do() \|\| ! do() ;' Undefined subroutine &main::try` [download]	[reply] [d/l] [select]
Re: Re: Regex: get first N characters but break at whitespace by George_Sherston (Vicar) on Jan 14, 2002 at 21:11 UTC
This appeals to the thug in me :) Q: is there a processor saving in using substr rather than a regex? § George Sherston	[reply]
Re: Re: Re: Regex: get first N characters but break at whitespace by gav^ (Curate) on Jan 14, 2002 at 21:31 UTC
A massive amount, substr has a hell of a lot simpler job to do. `Benchmark: timing 500000 iterations of regexp, substr... regexp: 11 wallclock secs (10.16 usr + 0.00 sys = 10.16 CPU) +@ 49236.83/s (n=500000) substr: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) +@ 554938.96/s (n=500000)` [download] gav^	[reply] [d/l]
Re3: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 15, 2002 at 15:57 UTC
Q: is there a processor saving in using substr rather than a regex? In general yes, but not if you have to combine the substr with a regex (as MZSanford does here). The benchmark shows that the pure regex approach suggested by tye is quickest for your problem, closely followed by japhy's version using fancier regex constructs. MZSanford's substr/substitute is slow (and a bit buggy, fixed that below :) because it tries to start the match at every interior whitespace. But you can improve on it: `($chunk) = substr($string,0,201) =~ /(.)\s+\w$/',` Here are the results of the benchmark: `Benchmark: running Hofmator, MZSanford, japhy, tye, each for at least +3 CPU seconds... Hofmator: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) @ 20 +6100.67/s (n=618302) MZSanford: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 55 +936.63/s (n=169488) japhy: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 25 +6036.67/s (n=768110) tye: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 29 +2146.67/s (n=876440)` [download] generated by this code: `#!/usr/bin/perl use Benchmark qw/timethese/; $string = q/Some text repeated / x 50; timethese(-3, { MZSanford => '$chunk = substr($string,0,201);$chunk =~ s/\s+\w$// +', Hofmator => '($chunk) = substr($string,0,201) =~ /(.)\s+\w*$/', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/', });` [download] -- Hofmator	[reply] [d/l] [select]
Re: Re3: Regex: get first N characters but break at whitespace by MZSanford (Curate) on Jan 15, 2002 at 16:35 UTC
Re5: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 16, 2002 at 16:51 UTC
Re: Re3: Regex: get first N characters but break at whitespace by gav^ (Curate) on Jan 15, 2002 at 20:27 UTC
Re5: Regex: get first N characters but break at whitespace by Hofmator (Curate) on Jan 16, 2002 at 16:55 UTC
Re: Regex: get first N characters but break at whitespace by AidanLee (Chaplain) on Jan 14, 2002 at 21:06 UTC
gratuitous Super Search result: How can I split a line on word boundaries closest to a certain length?	[reply]
Re: Regex: get first N characters but break at whitespace by broquaint (Abbot) on Jan 14, 2002 at 20:23 UTC
If you want the first 200 characters stopping stopping at whitespace, something like this should do the job... `($text) = $var =~ /^(.{200})(?:\s+\w)?/s; print $text.$/;` [download] This saves what is captured in the first set of parentheses into `$text`, and the match is the first 200 characters until it hits some whitespace followed by what looks like a word. I'm sure you could probably use some of perl's extended regexp capabilities, but that seems to do the job. HTH broquaint Update*: apparently you can ;o)	[reply] [d/l]
Re: Regex: get first N characters but break at whitespace by gav^ (Curate) on Jan 14, 2002 at 20:50 UTC
I doubt this counts as more elegant, but... use strict; use warnings; my $text = q{ I need to extract the first several words from a paragraph of text con +tained in a $var, so as to get the longest extract that's less than or equal to 200 char +acters. My brute-force-and-ignorance method is: blah some more text here etc e +tc }; my $chunk; ($chunk) = $text =~ /^(.{1,200})(?<!\s)(?!\w)/; printf "[%d] %s\n", length($chunk), $chunk; $chunk = $text; $chunk =~ s/^(.{200}).$/$1/; $chunk =~ s/^(.)\s+\w$/$1/; printf "[%d] %s\n", length($chunk), $chunk; $chunk = ""; $text =~ s/^\s+//; foreach (split /\s+/, $text) { if (length($chunk) + length($_) <= 200) { $chunk .= $_ . " "; } else { last; } } chop $chunk; printf "[%d] %s\n", length($chunk), $chunk; [download] Output:* Use of uninitialized value in length at C:\temp\ws.pl line 15. Use of uninitialized value in printf at C:\temp\ws.pl line 15. [0] '' [242] ' I need to extract the first several words from a paragraph of text con +tained in a $var, so as to get the longest extract that's less than or equal to 200 char +acters. My brute-force-and-ignorance method is: blah some more text here etc e +tc ' [194] 'I need to extract the first several words from a paragraph of t +ext contai ned in a $var, so as to get the longest extract that's less than or eq +ual to 200 characters. My brute-force-and-ignorance' [download] gav^	[reply] [d/l] [select]