in reply to Re: Extracting a substring of N chars ignoring embedded HTML
in thread Extracting a substring of N chars ignoring embedded HTML

Given the two updates, LTjake certainly has the approach I would take, but I think the version I saw would end up scanning the entire input post, rather than quiting as soon as the output string is done. Here's the way the while loop was written when I first saw it (with my commentary added):
while ( my $token = $p->get_token ) { if ($token->is_text) { if (length($token->return_text) + $total <= 200) { $doc2 .= $token->return_text; $total += length($token->return_text); } else { for (split / /, $token->return_text) { if ($total + length($_) <= 200) { $doc2 .= $_ . ' '; $total += length($_) + 1; } else { last; ## THIS ONLY EXITS THE FOR LOOP } ## So this block runs over the } ## entire remainder of the post chop($doc2) if $doc2 =~ /\s$/; } } else { $doc2 .= $token->as_is; } }
The solution would be to add the length test to the while loop condition, or else figure a way to avoid an inner for loop, so that "last" will really finish things off. And some other nit-picks:

So here's my version of LTjake's while loop (not tested):

while ( my $token = $p->get_token ) { my $tkntext = $token->as_is; $tkntext =~ s/\s+/ /g; # normalize all whitespace if ($token->is_text) { if (length($tkntext) + $total <= 200) { $doc2 .= $tkntext; $total += length($tkntext); } else { my $maxlen = 200 - $total; $doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxl +en ); last; # this finishes the while loop } } else { $doc2 .= " $tkntext "; } }

Replies are listed 'Best First'.
Re: Re: Re: Extracting a substring of N chars ignoring embedded HTML
by LTjake (Prior) on Jan 12, 2003 at 14:25 UTC
    Ah yes! Good catch graff. The big problem, as you noted was that I was parsing the whole text. Simply adding another last after the chop(...) fixes that. The other was splitting on one space -- normalizing whitespace like you've done fixes that. Lastly, calling return text all those times is really bad like you've noted, so i just assigned it to a var (and used as_is). Now, while i still like yours better, here's an updated version of mine :)
    while ( my $token = $p->get_token ) { if ($token->is_text) { my $text = $token->as_is; $text =~ s/\s+/ /g; if (length($text) + $total <= 200) { $doc2 .= $text; $total += length($text); } else { for (split / /, $text) { if ($total + length($_) <= 200) { $doc2 .= $_ . ' '; $total += length($_) + 1; } else { last; } } chop($doc2) if $doc2 =~ /\s$/; last; } } else { $doc2 .= $token->as_is; } }
    Thanks for the feedback!

    Minor fix to yours:
    $doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxlen );
    Should be:
    $doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxlen ) );
    it was missing a bracket. =)

    --
    "To err is human, but to really foul things up you need a computer." --Paul Ehrlich