I think your basic idea, of using a stack of html tags so you can close out open tags after truncating the text, is basically sound, and can be combined pretty easily with a good HTML parsing module.

Here's a crude example that seems to work on some relatively simple HTML data that I tried. There is certainly room for improvement and there are bound to be situations in HTML that will cause it to go wrong, but it's a start...

#!/usr/bin/perl use strict; use HTML::TokeParser::Simple; my $src; { # read the entire HTML input stream as one contiguous string: local $/ = undef; $src = <>; } my $htm = HTML::TokeParser::Simple->new( \$src ); my $targetlen = int( 0.15 * length( $src )); # this is a flawed attempt to select 15% of original content my $outtext = ''; my $outlen = 0; my @tagstack; while ( my $tkn = $htm->get_token ) { if ( $tkn->is_start_tag ) { # this is a start tag print $tkn->as_is; next if ( $$tkn[1] =~ /^(img|hr|meta|link|br)$/ ); # img,hr,meta,link tags don't span text content my $tagname = $tkn->return_tag; push @tagstack, $tagname unless ( $tagname =~ /^p$/i and $tagstack[$#tagstack] =~ / +^p$/i ); } elsif ( $tkn->is_end_tag ) { # this is an end tag print $tkn->as_is; my $tagname = $tkn->return_tag; if ( grep /^$tagname$/i, @tagstack ) { while ( $tagstack[$#tagstack] !~ /^$tagname$/i ) { pop @tagstack; } pop @tagstack; } } elsif ( $tkn->is_text ) { # this is text content my $txttkn = $tkn->as_is; $txttkn =~ s/\s+/ /g; my $txtlen = length( $txttkn ); if ( $txtlen > $targetlen ) { my $cut = rindex( $txttkn, ' ', $targetlen ); $txttkn = substr( $txttkn, 0, $cut ); print "\n$txttkn\n"; last; } print "\n$txttkn\n"; $targetlen -= $txtlen; } } while ( @tagstack ) { printf "</%s>\n", pop @tagstack; }

In reply to Re: substr(ingifying) htmlized text by graff
in thread substr(ingifying) htmlized text by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.