dstar has asked for the wisdom of the Perl Monks concerning the following question:

I have a need for a substr() like function that works on words, rather than characters; for purposes of this discussion, we'll assume a word is a sequence of non-whitespace characters separated by whitespace.

Tilly gave me a code snippet which I have hacked into the following:

if ($window_start > 0) { $text_body =~ /\S+/g foreach 1..$window_start; $start_index = pos($text_body); } else { $start_index = 0; } $text_body =~ /\S+/g foreach 1..$window_size; $end_index = pos($text_body); $windowed_text = substr($text_body, $start_index, ($end_index - $start +_index));

Here's my problem: The first match works perfectly. $start_index gets set appropriately for the $window_start'th word. The second match doesn't work at all. Even if $window_start is 0, and this the second match is the only match.

All I can think is that either I don't understand something, or I'm seeing what I know I wrote rather than what I wrote -- but I've tried to rule out the latter.

Replies are listed 'Best First'.
Re: Slicing a string on words
by George_Sherston (Vicar) on Aug 28, 2001 at 02:41 UTC
    Err... I don't know what to suggest, because as far as I can make out it does work:
    $window_start = 5; $window_size = 5; $text_body = "Please ignore all this text, this is the text wanted, pl +ease none of this text"; if ($window_start > 0) { $text_body =~ /\S+/g foreach 1..$window_start; $start_index = pos($text_body); } else { $start_index = 0; } $text_body =~ /\S+/g foreach 1..$window_size; $end_index = pos($text_body); $windowed_text = substr($text_body, $start_index, ($end_index - $start +_index)); print $windowed_text; #prints " this is the text wanted,"
    You just gotta believe! If faith is not enough, then may I suggest you post the contents of your $vars?

    By the way, you could alter your last line slightly to$windowed_text = substr($text_body, $start_index + 1, ($end_index - $start_index))- then you lose the leading space.

    § George Sherston
      See this? This is my head banging into the wall. Wall, meet head. Head, wall.

      Here's what the problem seems to be: when I tested it, it was with a short string and a window large enough that the string ended well before the end of the window. IE, a 20 word string with a 100 word window.

      Argh. I'm not sure where to go from here. Perhaps coffee is a good idea.

        The whiles in my update should bail out when they run out of matches.

        You may test on some input but it seems to take equal or less than window just fine

Re: Slicing a string on words
by dga (Hermit) on Aug 28, 2001 at 02:34 UTC

    It seems you want a window x words big?

    @words=$text_body =~ /(\S+)/g; #compute $start and $end $newtext=join(" ",@words[$start..$end]);

    Update:Without copying. Note: I could not get the foreach deal to work at all so had to use while with a counter.

    # $start and $window $i++ while($i<$start && /(\S+\s+)/g); $i=0; while(/\G(\S+\s+)/g && $i<$window_length) { $newtext .= "$1"; $i++; }

    Note also: Two types of while. Probably should pick one you like the best and standardize on that.

    This skips the $start number of words and the assigns $window_length words to $newtext preserving whitespace

    Another Update: $start of 0 was not working but reversing the tests in the first while fixes that.

      Ah, left out a relevant bit of info: I need to preserve whitespace. So what I *really* need is the index of the end of the $window_start'th word, and the index of the ($window_start + $window_size)'th word.

      And these are potentially 1 meg strings, so I'd like to avoid copies if possible.

      Doesn't seem to work: Given $window_start of 0, $window_size of 100, and $text_body of 'Testing news submission.', it gives 'news'.

      It also seems to make....wait. Ok, I know where Testing is going and can fix that. It's not working on teh last word because there's no whitespace after it. Would changing \s+ to \s* work?

        I think it would be safest (and safe to assume) that a 1 Meg string will be stored in a file and terminated with some sort of CR and or LF and in fact the pattern match I have would require that to be the case. The pattern would get a lot more complex if you could end a string with a \S type of entity.

        However you have noticed a problem with the original code with a start of 0. I will update that bit.

(tye)Re: Slicing a string on words
by tye (Sage) on Aug 28, 2001 at 20:54 UTC

    Would something like this be simpler?

    $window_size--; my( $windowed_text )= $text_body =~ /^\s*(?:\S+\s+){$window_start}(\S+(?:(\s+\S+){0,$window_size})/;
    You might need some code to handle edge cases like 0 for $window_start or $window_size. Note that I used {0,$window_size} so that asking for too big of a window just matches through the last word.

    Note that my technique can also give you the same index information via @- and @+ if you have Perl v5.6 or higher.

            - tye (but my friends call me "Tye")

      Very nice solution, tye!!

      There's a small typo in the regex (a surplus parenthesis), so here goes the code again, corrected

      $window_size--; my( $windowed_text )= $text_body =~ /^\s*(?:\S+\s+){$window_start}(\S+(?:\s+\S+){0,$window_size})/;
      What I wanted to add is that Perl handles the boundary cases very nicely, so no extra handling required for $window_start = 0 or $window_size = 0. This means
      $_ = q/01234/; /^..{0}/; # matches '0' /^..{0,0}/; # matches '0' /^..{0,-1}/; # doesn't match at all
      which is exactly what we need for the code to work fine.

      -- Hofmator

        Thanks. Note that $window_size of 0 will probably behave the same as $window_size of 1, though.

        Update: Sorry, wrong. I misunderstood your examples. /..{0,-1}/ matches any two characters followed by the literal string "{0,-1}". This could still be a problem for certain input values, but such seem pretty unlikely. So you might want some special code for the boundary case, depending on how varied your inputs might be.

                - tye (but my friends call me "Tye")

      What if $window_start > 32767 ?

      It would seem with an average word length of say 5 that you could get more than 174,000 words per Megabyte of string input.