Re: Slicing a string on words
by George_Sherston (Vicar) on Aug 28, 2001 at 02:41 UTC
|
Err... I don't know what to suggest, because as far as I can make out it does work:
$window_start = 5;
$window_size = 5;
$text_body = "Please ignore all this text, this is the text wanted, pl
+ease none of this text";
if ($window_start > 0) {
$text_body =~ /\S+/g foreach 1..$window_start;
$start_index = pos($text_body);
} else {
$start_index = 0;
}
$text_body =~ /\S+/g foreach 1..$window_size;
$end_index = pos($text_body);
$windowed_text = substr($text_body, $start_index, ($end_index - $start
+_index));
print $windowed_text;
#prints " this is the text wanted,"
You just gotta believe! If faith is not enough, then may I suggest you post the contents of your $vars?
By the way, you could alter your last line slightly to$windowed_text = substr($text_body, $start_index + 1, ($end_index - $start_index))- then you lose the leading space.
§ George Sherston | [reply] [d/l] [select] |
|
|
See this? This is my head banging into the wall. Wall, meet head. Head, wall.
Here's what the problem seems to be: when I tested it, it was with a short string and a window large enough that the string ended well before the end of the window. IE, a 20 word string with a 100 word window.
Argh. I'm not sure where to go from here. Perhaps coffee is a good idea.
| [reply] |
|
|
| [reply] |
Re: Slicing a string on words
by dga (Hermit) on Aug 28, 2001 at 02:34 UTC
|
@words=$text_body =~ /(\S+)/g;
#compute $start and $end
$newtext=join(" ",@words[$start..$end]);
Update:Without copying. Note: I could not get the foreach deal to work at all so had to use while with a counter.
# $start and $window
$i++ while($i<$start && /(\S+\s+)/g);
$i=0;
while(/\G(\S+\s+)/g && $i<$window_length)
{
$newtext .= "$1";
$i++;
}
Note also: Two types of while. Probably should pick one you like the best and standardize on that.
This skips the $start number of words and the assigns $window_length words to $newtext preserving whitespace
Another Update: $start of 0 was not working but reversing the tests in the first while fixes that.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
| [reply] |
|
|
I think it would be safest (and safe to assume) that a 1 Meg string will be stored in a file and terminated with some sort of CR and or LF and in fact the pattern match I have would require that to be the case.
The pattern would get a lot more complex if you could end a string with a \S type of entity.
However you have noticed a problem with the original code with a start of 0. I will update that bit.
| [reply] |
(tye)Re: Slicing a string on words
by tye (Sage) on Aug 28, 2001 at 20:54 UTC
|
$window_size--;
my( $windowed_text )= $text_body =~
/^\s*(?:\S+\s+){$window_start}(\S+(?:(\s+\S+){0,$window_size})/;
You might need some code to handle edge cases like 0 for $window_start or $window_size. Note that I used
{0,$window_size} so that asking for too big of a window just matches through the last word.
Note that my technique can also give you the same index information via @- and @+ if you have Perl v5.6 or higher.
-
tye
(but my friends call me "Tye") | [reply] [d/l] [select] |
|
|
Very nice solution, tye!!
There's a small typo in the regex (a surplus parenthesis), so here goes the code again, corrected
$window_size--;
my( $windowed_text )= $text_body =~
/^\s*(?:\S+\s+){$window_start}(\S+(?:\s+\S+){0,$window_size})/;
What I wanted to add is that Perl handles the boundary cases very nicely, so no extra handling required for $window_start = 0 or $window_size = 0. This means
$_ = q/01234/;
/^..{0}/; # matches '0'
/^..{0,0}/; # matches '0'
/^..{0,-1}/; # doesn't match at all
which is exactly what we need for the code to work fine.
-- Hofmator
| [reply] [d/l] [select] |
|
|
Thanks. Note that $window_size of 0 will probably behave the same as $window_size of 1, though.
Update: Sorry, wrong. I misunderstood your examples. /..{0,-1}/ matches any two characters followed by the literal string "{0,-1}". This could still be a problem for certain input values, but such seem pretty unlikely. So you might want some special code for the boundary case, depending on how varied your inputs might be.
-
tye
(but my friends call me "Tye")
| [reply] |
|
|
What if $window_start > 32767 ?
It would seem with an average word length of say 5 that you could get more than 174,000 words per Megabyte of string input.
| [reply] |