I am writing a weblog search engine, which will display a portion of each post in the main results list. I want to show only the first 200 characters of the post, but still make sure that no word at the end gets cut off abruptly. If the string in question were just plain text, this would be easy to do in a regular expression:

my ($fragment) = $full_text =~ /^(.{1,200}[^\s]*)/s;
Unless I am wrong, this returns up to two hundred characters, followed by any non-whitespace characters immediately following the excerpt.

Unfortunately, my data is not plain text, but also contains HTML tags. I don't want those tags (which include lengthy URLs) to count towards the 200 limit, since I am only interested in what gets displayed as text in the browser. But I also don't want to strip the tags out.

So far the only solution in my mind is this:

my @chars = split //, $full_text; my @excerpt; my $in_tag = 0; my $limit=200; # Get exactly $limit characters of text while ( my $char = shift @chars ) { push @excerpt, $char; $in_tag++ if $char eq '<'; $in_tag-- if $char eq '>'; $count++ unless $in_tag; last if $count > $limit and !$in_tag; } #Now make sure we get the last word until boundary $in_tag = 0; # just in case while ( my $char = shift @chars ){ $in_tag++ if $char eq '<'; $in_tag-- if $char eq '>'; last if $char =~/\s/ and !$in_tag; push @excerpt, $char; } my $excerpt = join ('', @excerpt);
I don't like this, it feels un-Perlish. I've tried a search on CPAN without good results, partly because I don't know what to call my problem. HTML::Highlight just seems to ignore the problem of tags altogether by getting rid of them.

I wait humbly for advice. Sample post follows.

__DATA__ A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>. Contrar +y to what the article says, Time is <b>not</b> checking for multiple +votes on <a href="javascript:document.timedigital.submit();">their poll</a>. And I'm happy to report that despite the fact that my cheater scripts aren't running, I'm still beating Bill Gates.

In reply to Extracting a substring of N chars ignoring embedded HTML by FamousLongAgo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.