George_Sherston has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract the first several words from a paragraph of text contained in a $var, so as to get the longest extract that's less than or equal to 200 characters. My brute-force-and-ignorance method is:
s/^(.{200}).*$/$1/; s/^(.*)\s+\w*$/$1/;
... but I'd love to know a more elegant way, if some kind monk can oblige me.

§ George Sherston

Replies are listed 'Best First'.
Re: Regex: get first N characters but break at whitespace
by japhy (Canon) on Jan 14, 2002 at 20:14 UTC
    Look-ahead is your friend. ($chunk) = /^(.{1,200})(?<!\s)(?!\w)/; That matches as many (up to 200) characters, such that the last character is NOT whitespace, and the next character is NOT a word character.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

(tye)Re: Regex: get first N characters but break at whitespace
by tye (Sage) on Jan 14, 2002 at 20:34 UTC

    my( $start )= /^(.{0,199}\S)\s/;

            - tye (but my friends call me "Tye")
Re: Regex: get first N characters but break at whitespace
by MZSanford (Curate) on Jan 14, 2002 at 20:23 UTC
    I use force, but slightly less brute ;)
    (untested code ahead)
    my $tmp = substr($_,0,200); $tmp =~ s/\s+\w*$//g;

    $ perl -e 'do() || ! do() ;' Undefined subroutine &main::try
      This appeals to the thug in me :) Q: is there a processor saving in using substr rather than a regex?

      § George Sherston
        A massive amount, substr has a hell of a lot simpler job to do.
        Benchmark: timing 500000 iterations of regexp, substr... regexp: 11 wallclock secs (10.16 usr + 0.00 sys = 10.16 CPU) +@ 49236.83/s (n=500000) substr: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) +@ 554938.96/s (n=500000)

        gav^

        Q: is there a processor saving in using substr rather than a regex?

        In general yes, but not if you have to combine the substr with a regex (as MZSanford does here).

        The benchmark shows that the pure regex approach suggested by tye is quickest for your problem, closely followed by japhy's version using fancier regex constructs. MZSanford's substr/substitute is slow (and a bit buggy, fixed that below :) because it tries to start the match at every interior whitespace. But you can improve on it: ($chunk) = substr($string,0,201) =~ /(.*)\s+\w*$/',

        Here are the results of the benchmark:

        Benchmark: running Hofmator, MZSanford, japhy, tye, each for at least +3 CPU seconds... Hofmator: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) @ 20 +6100.67/s (n=618302) MZSanford: 4 wallclock secs ( 3.03 usr + 0.00 sys = 3.03 CPU) @ 55 +936.63/s (n=169488) japhy: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 25 +6036.67/s (n=768110) tye: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 29 +2146.67/s (n=876440)

        generated by this code:

        #!/usr/bin/perl use Benchmark qw/timethese/; $string = q/Some text repeated / x 50; timethese(-3, { MZSanford => '$chunk = substr($string,0,201);$chunk =~ s/\s+\w*$// +', Hofmator => '($chunk) = substr($string,0,201) =~ /(.*)\s+\w*$/', japhy => '($chunk) = $string =~ /^(.{1,200})(?<!\s)(?!\w)/;', tye => '($chunk) = $string =~ /^(.{0,199}\S)\s/', });

        -- Hofmator

Re: Regex: get first N characters but break at whitespace
by AidanLee (Chaplain) on Jan 14, 2002 at 21:06 UTC
Re: Regex: get first N characters but break at whitespace
by broquaint (Abbot) on Jan 14, 2002 at 20:23 UTC
    If you want the first 200 characters stopping stopping at whitespace, something like this should do the job...
    ($text) = $var =~ /^(.{200})(?:\s+\w*)?/s; print $text.$/;
    This saves what is captured in the first set of parentheses into $text, and the match is the first 200 characters until it hits some whitespace followed by what looks like a word. I'm sure you could probably use some of perl's extended regexp capabilities, but that seems to do the job.
    HTH

    broquaint

    Update: apparently you can ;o)

Re: Regex: get first N characters but break at whitespace
by gav^ (Curate) on Jan 14, 2002 at 20:50 UTC
    I doubt this counts as more elegant, but...
    use strict; use warnings; my $text = q{ I need to extract the first several words from a paragraph of text con +tained in a $var, so as to get the longest extract that's less than or equal to 200 char +acters. My brute-force-and-ignorance method is: blah some more text here etc e +tc }; my $chunk; ($chunk) = $text =~ /^(.{1,200})(?<!\s)(?!\w)/; printf "[%d] %s\n", length($chunk), $chunk; $chunk = $text; $chunk =~ s/^(.{200}).*$/$1/; $chunk =~ s/^(.*)\s+\w*$/$1/; printf "[%d] %s\n", length($chunk), $chunk; $chunk = ""; $text =~ s/^\s+//; foreach (split /\s+/, $text) { if (length($chunk) + length($_) <= 200) { $chunk .= $_ . " "; } else { last; } } chop $chunk; printf "[%d] %s\n", length($chunk), $chunk;
    Output:
    Use of uninitialized value in length at C:\temp\ws.pl line 15. Use of uninitialized value in printf at C:\temp\ws.pl line 15. [0] '' [242] ' I need to extract the first several words from a paragraph of text con +tained in a $var, so as to get the longest extract that's less than or equal to 200 char +acters. My brute-force-and-ignorance method is: blah some more text here etc e +tc ' [194] 'I need to extract the first several words from a paragraph of t +ext contai ned in a $var, so as to get the longest extract that's less than or eq +ual to 200 characters. My brute-force-and-ignorance'

    gav^