AgentM has asked for the wisdom of the Perl Monks concerning the following question:

Whoa! I need a quick way to grab the first 30 space-divided words in a simple string. I was thinking of something along the lines of
$desc=~tr/$(*.?) {1,30}/???????/i;
I think I'm along the right lines here...
yes, I have thought of split but the way i see it, i would split and unneccarily join again. perhaps a simple for loop followed by a substr is the solution? After a brief discussion in the CBox, I decided that there must a better solution, preferably a regex... any ideas?
AgentM Systems or Nasca Enterprises is not responsible for the comments made by AgentM- anywhere.

Replies are listed 'Best First'.
Re: 30 Spaces- 1 question
by extremely (Priest) on Oct 09, 2000 at 06:55 UTC

    My recommendation:

    my @w = split /\s+/,$text,31; pop @w; my $words30= join " ", @w;

    dchetlin's regex I like too: (I modified it a little)

    $text =~ m/^((?:\S+\s+){1,30})/; my $words30 = $1;

    Tho with that all the usual warnings about $1 being set to a prior match may need dealing with.

    If ABSOLUTELY sure there are no double spaces or other nastiness, you could use index() in a for loop too. But it's ugly so someone else can do that...

    --
    $you = new YOU;
    honk() if $you->love(perl)

      I appreciate your liking my REx, but be careful; that \s* is a `*' for a reason. Try yours out:

      [~] $ perl -wnle'/^((?:\S+\s+){1,30})/;print $1' one two three four five six seven one two three four five six

      versus mine:

      [~] $ perl -wnle'/^((?:\S+\s*){1,30})/;print $1' one two three four five six seven one two three four five six seven

      Also, your split solution has a minor problem:

      [~] $ perl -wnle'@w=split/\s+/,$_,31;pop@w;print join " ",@w' one two three four five one two three four

      (Granted, the problem specs weren't that great, but it seems reasonable to assume that if we have a line with less than 30 words, we don't want to throw away the last.)

      -dlc

        Actually, I had rather taken it as gospel that the string was more than 30 chunks. =) good spot on that.

        I still hate throwing the regex engine at this problem tho. Maybe I'll Benchmark em all and post that. Make me feel better for being a goof.

        --
        $you = new YOU;
        honk() if $you->love(perl)

Re: 30 Spaces- 1 question
by extremely (Priest) on Oct 09, 2000 at 13:18 UTC
    Benchmarks, first the code: (GAK! See post from dchetlin)
    use strict; use Benchmark qw(cmpthese); my @x; $x[0]= "mark " x 35; $x[1]= "asdfasdfasdfasdfasdf " x 35; $x[2]= "as " x 31; $x[3]= "asdfasdfasdfasdfasdfasdf " x 100; $x[4]= "asdfasdfasdfasdfasdfasdf " x 10; # http://www.cpan.org/doc/manual/html/pod/perlfunc/index.html # # Well, that was no help... fricking http conventions # cmpthese (1000000, { 'regexp' => ' foreach (@x) { /^((?:\S+\s*){1,30})/; print $1; } ', 'split_join' => ' foreach (@x) { print join ( " ", (split " ", $_,31)[0..29] ); } ', 'for_index_substr' => ' foreach (@x) { my $ind = index ($_, " "); for (0..29) { last if $ind == -1; $ind = index $_, " ", $ind; } print substr($_,0,$ind); } ', } );

    and now the results (basically, a huge fricking waste of time, we are talking a MILLION reps on p166 here.)

                        Rate       split_join for_index_substr           regexp
    split_join       69061/s               --              -7%              -7%
    for_index_substr 73910/s               7%               --              -1%
    regexp           74627/s               8%               1%               --
    ### second run, no changes... hmmm....
                        Rate for_index_substr       split_join           regexp
    for_index_substr 68027/s               --              -9%              -9%
    split_join       74349/s               9%               --              -0%
    regexp           74516/s              10%               0%               --
    

    At those rates, any of those would likely be waiting on the harddrive to feed them data. My dataset is about 3.5kB so that means 232MB/s feedrate on the WORST run there and 255MB/s on the best. =) Do it anyway you want. Not going to matter, not one iota, in the long run.

    --
    $you = new YOU;
    honk() if $you->love(perl)

      Hmm. Sorry to be the parade-rainer, but the benchmark isn't actually measuring what it should be. @x is a lexical here, and when Benchmark takes those strings and evaluates them, @x is out of scope. That's why they're all going so fast; there's nothing to loop over.

      Also, and more minorly, your for_index_substr routine isn't working -- $_ is being re-aliased on the second for loop and the actual target string is being lost.

      I took the liberty of making a couple of changes and re-running:

      use strict; use Benchmark qw(cmpthese); my @x; my @result; $x[0]= "mark " x 35; $x[1]= "asdfasdfasdfasdfasdf " x 35; $x[2]= "as " x 31; $x[3]= "asdfasdfasdfasdfasdfasdf " x 100; $x[4]= "asdfasdfasdfasdfasdfasdf " x 10; cmpthese (-5, { 'regexp' => sub { my $i; foreach (@x) { /^((?:\S+\s*){1,30})/; $result[$i++]{REx} = $1; } }, 'split_join' => sub { my $i; foreach (@x) { $result[$i++]{split} = join " ", (split " ", $_,31)[0..29]; } }, 'for_index_substr' => sub { my $i; foreach (@x) { my $ind = index ($_, " "); for my $foo (0..28) { last if $ind == -1; $ind = index $_, " ", $ind + 1; } $result[$i++]{index} = substr($_,0,$ind); } }, } ); for (@result) { print "bad!" unless ($_{REx} eq $_{split} and $_{split} eq $_{index}); }

      This produces:

                         Rate       split_join for_index_substr           regexp
      split_join       4294/s               --              -7%             -27%
      for_index_substr 4618/s               8%               --             -21%
      regexp           5849/s              36%              27%               --

      -dlc

        *sigh* That will teach me, eh? I should have run it with it not printing to STDERR and that piped to /dev/null =)

        Also, in my defense, it was after my bedtime =) Don't apologize for being sharp...

        What is worse it that I tested a working version of the for_index_substr but "cleaned it up" for the final run.

        I normally use vars ... to avoid the scope issue. =( *woe is me*

        OTOH, I don't get your numbers when running your code. Number one, your results check should be:

        foreach (@result) { print "bad!" unless ($_->{REx} eq $_->{split} and $_->{split} eq $_->{index}); }

        Those arrows are important. Second, the comparison will always fail because your regex includes the terminal space and the split version doesn't. Also, if there were multiple spaces or other types of whitespace they would fail but we both agreed to ignore that. =)

        Third, I still get equivalent results, (split_join_2 has the non-magical /\s+/ regex, just for fun. I get this with cmpthese 20,000. (using cmpthese -10 gives the equivalent results.)

                           Rate split_join_2   split_join        regexp for_index_substr
        split_join_2     3205/s           --          -0%           -4%              -6%
        split_join       3205/s           0%           --           -4%              -6%
        regexp           3344/s           4%           4%            --              -2%
        for_index_substr 3413/s           6%           6%            2%               --

        Worse, the array slice hack on the end of (split...)[0..29] throws warnings on -w if there are too few items in the slice to join. Join doesn't like undef it appears =)

        All in all, I'm now less enlightened now. Did I cut-n-paste wrong? I print debugged to verify that bits were working all thru it. *sigh*

        --
        $you = new YOU;
        honk() if $you->love(perl)

Re: 30 Spaces- 1 question
by dchetlin (Friar) on Oct 09, 2000 at 06:54 UTC
    tr is not the right tool here.

    if ($desc =~ m#((?:\S+\s*){1,30})#) { $first_30 = $1; } else { # Didn't match here }

    -dlc

RE: 30 Spaces- 1 question
by Zarathustra (Beadle) on Oct 09, 2000 at 07:00 UTC
    for ( 1..90) { $line .= "word "; } ( $thirty = $line ) =~ s/^((\w+\s){1,30}).*$/$1/; print $thirty;
    That's my shoddy solution at any rate.
Re: 30 Spaces- 1 question
by Trimbach (Curate) on Oct 09, 2000 at 09:18 UTC
    How about this?
    @first_thirty[0..29] = split " ", $string;
    or, if you want the first 30 words as a single try you can:
    $first_thirty = join " ", (split " ", $string)[0..29];
    ...which doesn't avoid the split/join thing, but by specifying an array slice you might avoid the memory hit of splitting the entire string when all you want is the first 30 words.

    Gary Blackburn
    Trained Killer

      split has a form for avoiding overspliting, you can give it a count of how many parts you want, see my first post.

      @array = split / /, $string, 31;

      OTOH, your array slice will fix my dropping the last item with pop foolishness...

      --
      $you = new YOU;
      honk() if $you->love(perl)

Re: 30 Spaces- 1 question
by merlyn (Sage) on Oct 09, 2000 at 07:16 UTC