Extracting a substring of N chars ignoring embedded HTML

FamousLongAgo has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a weblog search engine, which will display a portion of each post in the main results list. I want to show only the first 200 characters of the post, but still make sure that no word at the end gets cut off abruptly. If the string in question were just plain text, this would be easy to do in a regular expression:

my ($fragment) = $full_text =~ /^(.{1,200}[^\s]*)/s;
[download]

Unless I am wrong, this returns up to two hundred characters, followed by any non-whitespace characters immediately following the excerpt.

Unfortunately, my data is not plain text, but also contains HTML tags. I don't want those tags (which include lengthy URLs) to count towards the 200 limit, since I am only interested in what gets displayed as text in the browser. But I also don't want to strip the tags out.

So far the only solution in my mind is this:

my @chars = split //, $full_text;
my @excerpt;
my $in_tag = 0;
my $limit=200;


# Get exactly $limit characters of text
while ( my $char = shift @chars ) {
    push @excerpt, $char;
    $in_tag++ if $char eq '<';
    $in_tag-- if $char eq '>';
    $count++ unless $in_tag;
    last if $count > $limit and !$in_tag;
}

#Now make sure we get the last word until boundary
$in_tag = 0; # just in case
while ( my $char = shift @chars ){
    $in_tag++ if $char eq '<';
    $in_tag-- if $char eq '>';
    last if $char =~/\s/ and !$in_tag;
    push @excerpt, $char;
}

my $excerpt = join ('', @excerpt);
[download]

I don't like this, it feels un-Perlish. I've tried a search on CPAN without good results, partly because I don't know what to call my problem. HTML::Highlight just seems to ignore the problem of tags altogether by getting rid of them.

I wait humbly for advice. Sample post follows.

__DATA__
A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx
+.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe
+rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article
+_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>. Contrar
+y to what the article says, Time is <b>not</b> checking for multiple 
+votes on
 <a href="javascript:document.timedigital.submit();">their 
poll</a>. And I'm happy to report that despite the fact that
 my cheater scripts aren't running, I'm still beating Bill Gates.
[download]

Comment on Extracting a substring of N chars ignoring embedded HTML Select or Download Code

Replies are listed 'Best First'.
Re: Extracting a substring of N chars ignoring embedded HTML by LTjake (Prior) on Jan 11, 2003 at 23:18 UTC
I'd use a module to strip the HTML. Here's a quick try: use HTML::TokeParser::Simple; my $doc = qq(story on GettingIt about <a href="http://ss.gettingit.com +/cgi-bin/gx.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTen +seContentServer=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templa +tes/Article_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</ +a>. Contrary to what the article says, Time is <b>not</b> checking fo +r multiple votes on <a href="javascript:document.timedigital.submit() +;">their poll</a>. And I'm happy to report that despite the fact that + my cheater scripts aren't running, I'm still beating Bill Gates.); my $doc2; my $total = 0; my $p = HTML::TokeParser::Simple->new( \$doc ); while ( my $token = $p->get_token ) { if ($token->is_text) { if (length($token->return_text) + $total <= 200) { $doc2 .= $token->return_text; $total += length($token->return_text); } else { for (split / /, $token->return_text) { if ($total + length($_) <= 200) { $doc2 .= $_ . ' '; $total += length($_) + 1; } else { last; } } chop($doc2) if $doc2 =~ /\s$/; } } else { $doc2 .= $token->as_is; } } print $doc2; [download] Prints story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx.c +gi/AppLogi c+FTContentServer?GXHC_gx_session_id_FutureTenseContentServer=7f12a816 +fa48a5b9&p agename=FutureTense/Demos/GI/Templates/Article_View&parm1=A1545-1999Oc +t12&topfra me=true">hacking polls</a>. Contrary to what the article says, Time is + <b>not</b > checking for multiple votes on <a href="javascript:document.timedigi +tal.submit ();">their poll</a>. And I'm happy to report that despite the fact tha +t my cheat er scripts [download] HTH Update: changed < limit to <= limit Update: strike that last one -- misunderstood the question. here's try #2! :) fixes welcome! -- "To err is human, but to really foul things up you need a computer." --Paul Ehrlich	[reply] [d/l] [select]
Re: Re: Extracting a substring of N chars ignoring embedded HTML by graff (Chancellor) on Jan 12, 2003 at 04:29 UTC
Given the two updates, LTjake certainly has the approach I would take, but I think the version I saw would end up scanning the entire input post, rather than quiting as soon as the output string is done. Here's the way the while loop was written when I first saw it (with my commentary added): `while ( my $token = $p->get_token ) { if ($token->is_text) { if (length($token->return_text) + $total <= 200) { $doc2 .= $token->return_text; $total += length($token->return_text); } else { for (split / /, $token->return_text) { if ($total + length($_) <= 200) { $doc2 .= $_ . ' '; $total += length($_) + 1; } else { last; ## THIS ONLY EXITS THE FOR LOOP } ## So this block runs over the } ## entire remainder of the post chop($doc2) if $doc2 =~ /\s$/; } } else { $doc2 .= $token->as_is; } }` [download] The solution would be to add the length test to the while loop condition, or else figure a way to avoid an inner for loop, so that "last" will really finish things off. And some other nit-picks: Rather than calling the same "return_text" method so many times, I'd rather use it once to set a local variable (and having just read the TokeParser::Simple docs, it seems that "return_text" is depricated anyway -- use "as_is" in all cases). The for loop uses a split on single space -- not what the OP intended -- but this leads to a question about the OP's stated goals: Would you want every input string of `/\s{1,}/` to be represented (and counted) as a single space in the output string? (Myself, I would think "yes".) So here's my version of LTjake's while loop (not tested): `while ( my $token = $p->get_token ) { my $tkntext = $token->as_is; $tkntext =~ s/\s+/ /g; # normalize all whitespace if ($token->is_text) { if (length($tkntext) + $total <= 200) { $doc2 .= $tkntext; $total += length($tkntext); } else { my $maxlen = 200 - $total; $doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxl +en ); last; # this finishes the while loop } } else { $doc2 .= " $tkntext "; } }` [download]	[reply] [d/l] [select]
Re: Re: Re: Extracting a substring of N chars ignoring embedded HTML by LTjake (Prior) on Jan 12, 2003 at 14:25 UTC
Ah yes! Good catch graff. The big problem, as you noted was that I was parsing the whole text. Simply adding another `last` after the `chop(...)` fixes that. The other was splitting on one space -- normalizing whitespace like you've done fixes that. Lastly, calling return text all those times is really bad like you've noted, so i just assigned it to a var (and used `as_is`). Now, while i still like yours better, here's an updated version of mine :) `while ( my $token = $p->get_token ) { if ($token->is_text) { my $text = $token->as_is; $text =~ s/\s+/ /g; if (length($text) + $total <= 200) { $doc2 .= $text; $total += length($text); } else { for (split / /, $text) { if ($total + length($_) <= 200) { $doc2 .= $_ . ' '; $total += length($_) + 1; } else { last; } } chop($doc2) if $doc2 =~ /\s$/; last; } } else { $doc2 .= $token->as_is; } }` [download] Thanks for the feedback! Minor fix to yours: `$doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxlen );` [download] Should be: `$doc2 .= substr( $tkntext, 0, rindex( $tkntext, ' ', $maxlen ) );` [download] it was missing a bracket. =) -- "To err is human, but to really foul things up you need a computer." --Paul Ehrlich	[reply] [d/l] [select]
Re: Re: Extracting a substring of N chars ignoring embedded HTML by FamousLongAgo (Friar) on Jan 11, 2003 at 23:46 UTC
You misunderstood my question. I am not interested in stripping the HTML, I want to keep it in the string. I just don't want the HTML tags to be counted as characters when I try to create a substring of length N.	[reply]
Re: Extracting a substring of N chars ignoring embedded HTML by Aristotle (Chancellor) on Jan 12, 2003 at 17:39 UTC
The others gave solutions with unnecessarily complicated loops. To get the exact same functionality is as simple as the following: `#!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; my $total = 0; my $p = HTML::TokeParser::Simple->new(DATA); my $length_left = shift \|\| 40; my $abstract = ""; while (my $t = $p->get_token) { $_ = $t->as_is; if ($t->is_text) { s/\s+/ /g; s/^(.{1,$length_left}\S)./$1/s; $length_left -= length; } $abstract .= $_; last unless $length_left > 0; } print $abstract, $/;` [download] However, this is broken: it will leave unbalanced open tags. A first step to fix this is to move the `last` inside the `if` so that any trailing closing tags right after the last piece of counted plaintext will be accepted. However, that can still leave us with unbalanced tags. So we need to have a stack: #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; my $total = 0; my $p = HTML::TokeParser::Simple->new(DATA); my $length_left = shift \|\| 40; my $abstract = ""; my @stack; while (my $t = $p->get_token) { $_ = $t->as_is; if ($t->is_text) { s/\s+/ /g; s/^(.{1,$length_left}\S)./$1/s; $length_left -= length; } elsif($t->is_start_tag) { push @stack, $t->return_tag; } elsif($t->is_end_tag) { pop @stack if $stack[-1] eq $t->return_tag; } $abstract .= $_; last unless $length_left > 0; } $abstract .= join '', map "</$_>", reverse @stack; print $abstract, $/; [download] Note this only takes minimal provisions for dealing with HTML with invalidly nested tags. The way it is, it may produce extraneous closing tags, which is usually the less harmful alternative. Except for this corner case, it does exactly what you need. You can bulletproof it against those cases if you spend significantly more time on the `elsif($t->is_end_tag)` branch. Makeshifts last the longest.	[reply] [d/l] [select]
Re: Extracting a substring of N chars ignoring embedded HTML by BrowserUk (Patriarch) on Jan 12, 2003 at 05:35 UTC
Probably won't win any prizes as it doesn't use any modules and uses regexes for parsing HTML, but it seems to deal with most things I've thrown at it, including both types of quoted attributes even when they contain embedded '>' chars. #! perl -slw use vars qw[$required]; use strict; sub abstract (\$$) { my ($data, $req) = @_; my ($s, $p) = (0, 0); while ($p < $req) { $$data =~ /[^<]{1,${\($req - $p)}}/gc; last if ($p += pos($$data) - $s) >= $req; my ($q) = $$data =~ /\G[^"'>]+(.)/gc; #!" $$data =~ /\G[^$q]+/gc if $q =~ /["']/; #!" $$data =~ /\G[^>]+/gc; $s = ++pos($$data); } $$data =~ /\G[\w]+/gc; return substr( $$data, 0, pos($$data) ); } my $data = join '', <DATA>; $data =~ tr/\n/ /d; print abstract( $data, $required\|\|40 ); __DATA__ A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx.cgi/AppLogic+FTContentServ +er?GXHC_gx_session_id_FutureTenseContentServer=7f12a816fa48a5b9&pagen +ame=FutureTense/Demos/GI/Templates/Article_View&parm1=A1545-1999Oct12 +&topframe=true">hacking polls</a>. Contrary to what the article says, Time is <b>not</b> checking for mul +tiple votes on <a href="javascript:document.timedigital.submit();">their poll</a>. And I'm happy to report that despite the fact that my cheater scripts aren't running, I'm still beating Bill Gates. Some <b>more</b> test data. <table name='a probably invalid> name' width='%80'> <TR align=right><TH>Things</TH><TH>Values</TH></TR> <TR align=center><TD>A thing</TD><TD>A value</TD></TR> </table> [download] Some sample output Read more... (5 kB) Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply] [d/l] [select]
Re^2: Extracting a substring of N chars ignoring embedded HTML by Aristotle (Chancellor) on Jan 12, 2003 at 15:03 UTC
It leaves cut off tags at the end of the output. I wouldn't want to print that verbatim. The parser solutions have no such problem. Makeshifts last the longest.	[reply]