in reply to Re: Re: Extracting a substring of N chars ignoring embedded HTML
in thread Extracting a substring of N chars ignoring embedded HTML

Ah yes! Good catch graff. The big problem, as you noted was that I was parsing the whole text. Simply adding another last after the chop(...) fixes that. The other was splitting on one space -- normalizing whitespace like you've done fixes that. Lastly, calling return text all those times is really bad like you've noted, so i just assigned it to a var (and used as_is). Now, while i still like yours better, here's an updated version of mine :)
while ( my $token = $p->get_token ) { if ($token->is_text) { my $text = $token->as_is; $text =~ s/\s+/ /g; if (length($text) + $total <= 200) { $doc2 .= $text; $total += length($text); } else { for (split / /, $text) { if ($total + length($_) <= 200) { $doc2 .= $_ . ' '; $total += length($_) + 1; } else { last; } } chop($doc2) if $doc2 =~ /\s$/; last; } } else { $doc2 .= $token->as_is; } }
Thanks for the feedback!

Minor fix to yours:
$doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxlen );
Should be:
$doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxlen ) );
it was missing a bracket. =)

--
"To err is human, but to really foul things up you need a computer." --Paul Ehrlich