FamousLongAgo has asked for the wisdom of the Perl Monks concerning the following question:
I am writing a weblog search engine, which will display a portion of each post in the main results list. I want to show only the first 200 characters of the post, but still make sure that no word at the end gets cut off abruptly. If the string in question were just plain text, this would be easy to do in a regular expression:
Unless I am wrong, this returns up to two hundred characters, followed by any non-whitespace characters immediately following the excerpt.my ($fragment) = $full_text =~ /^(.{1,200}[^\s]*)/s;
Unfortunately, my data is not plain text, but also contains HTML tags. I don't want those tags (which include lengthy URLs) to count towards the 200 limit, since I am only interested in what gets displayed as text in the browser. But I also don't want to strip the tags out.
So far the only solution in my mind is this:
I don't like this, it feels un-Perlish. I've tried a search on CPAN without good results, partly because I don't know what to call my problem. HTML::Highlight just seems to ignore the problem of tags altogether by getting rid of them.my @chars = split //, $full_text; my @excerpt; my $in_tag = 0; my $limit=200; # Get exactly $limit characters of text while ( my $char = shift @chars ) { push @excerpt, $char; $in_tag++ if $char eq '<'; $in_tag-- if $char eq '>'; $count++ unless $in_tag; last if $count > $limit and !$in_tag; } #Now make sure we get the last word until boundary $in_tag = 0; # just in case while ( my $char = shift @chars ){ $in_tag++ if $char eq '<'; $in_tag-- if $char eq '>'; last if $char =~/\s/ and !$in_tag; push @excerpt, $char; } my $excerpt = join ('', @excerpt);
I wait humbly for advice. Sample post follows.
__DATA__ A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>. Contrar +y to what the article says, Time is <b>not</b> checking for multiple +votes on <a href="javascript:document.timedigital.submit();">their poll</a>. And I'm happy to report that despite the fact that my cheater scripts aren't running, I'm still beating Bill Gates.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Extracting a substring of N chars ignoring embedded HTML
by LTjake (Prior) on Jan 11, 2003 at 23:18 UTC | |
by graff (Chancellor) on Jan 12, 2003 at 04:29 UTC | |
by LTjake (Prior) on Jan 12, 2003 at 14:25 UTC | |
by FamousLongAgo (Friar) on Jan 11, 2003 at 23:46 UTC | |
|
Re: Extracting a substring of N chars ignoring embedded HTML
by Aristotle (Chancellor) on Jan 12, 2003 at 17:39 UTC | |
|
Re: Extracting a substring of N chars ignoring embedded HTML
by BrowserUk (Patriarch) on Jan 12, 2003 at 05:35 UTC | |
by Aristotle (Chancellor) on Jan 12, 2003 at 15:03 UTC |