Re: substr(ingifying) htmlized text

I think your basic idea, of using a stack of html tags so you can close out open tags after truncating the text, is basically sound, and can be combined pretty easily with a good HTML parsing module.

Here's a crude example that seems to work on some relatively simple HTML data that I tried. There is certainly room for improvement and there are bound to be situations in HTML that will cause it to go wrong, but it's a start...

#!/usr/bin/perl

use strict;
use HTML::TokeParser::Simple;

my $src;
{  # read the entire HTML input stream as one contiguous string:
    local $/ = undef;
    $src = <>;
}

my $htm = HTML::TokeParser::Simple->new( \$src );

my $targetlen = int( 0.15 * length( $src ));
   # this is a flawed attempt to select 15% of original content

my $outtext = '';
my $outlen = 0;
my @tagstack;

while ( my $tkn = $htm->get_token )
{
    if ( $tkn->is_start_tag ) { # this is a start tag
        print $tkn->as_is;
        next if ( $$tkn[1] =~ /^(img|hr|meta|link|br)$/ );
             # img,hr,meta,link tags don't span text content
        my $tagname = $tkn->return_tag;
        push @tagstack, $tagname
            unless ( $tagname =~ /^p$/i and $tagstack[$#tagstack] =~ /
+^p$/i );
    }
    elsif ( $tkn->is_end_tag ) { # this is an end tag
        print $tkn->as_is;
        my $tagname = $tkn->return_tag;
        if ( grep /^$tagname$/i, @tagstack ) {
            while ( $tagstack[$#tagstack] !~ /^$tagname$/i ) {
                pop @tagstack;
            }
            pop @tagstack;
        }
    }
    elsif ( $tkn->is_text ) {   # this is text content
        my $txttkn = $tkn->as_is;
        $txttkn =~ s/\s+/ /g;
        my $txtlen = length( $txttkn );
        if ( $txtlen > $targetlen ) {
            my $cut = rindex( $txttkn, ' ', $targetlen );
            $txttkn = substr( $txttkn, 0, $cut );
            print "\n$txttkn\n";
            last;
        }
        print "\n$txttkn\n";
        $targetlen -= $txtlen;
    }
}

while ( @tagstack ) {
    printf "</%s>\n", pop @tagstack;
}
[download]

Comment on Re: substr(ingifying) htmlized text Download Code

Replies are listed 'Best First'.
Re^2: substr(ingifying) htmlized text by punkish (Priest) on Sep 24, 2005 at 16:59 UTC
graff++ While I was not expecting working code off my pseudo... you gave, and it works. I made the following mods -- Substituted `grep` with a straightforward loop through the array and `return 1` when compare is successful. This significantly improved the performance. Removed ::Simple to get to HTML::TokeParser directly (my webhost doesn't have H::T::S installed, and I didn't want to bother them... besides, perhaps getting to the base module directly perhaps squeezes out a little bit more performance. It works, and all thanks and credit to you. -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: substr(ingifying) htmlized text
by punkish (Priest) on Sep 24, 2005 at 16:59 UTC

graff

While I was not expecting working code off my pseudo... you gave, and it works.

I made the following mods --

Substituted grep with a straightforward loop through the array and return 1 when compare is successful. This significantly improved the performance.

Removed ::Simple to get to HTML::TokeParser directly (my webhost doesn't have H::T::S installed, and I didn't want to bother them... besides, perhaps getting to the base module directly perhaps squeezes out a little bit more performance.

It works, and all thanks and credit to you.

--

when small people start casting long shadows, it is time to go to bed

[reply]
[d/l]
[select]