eternius has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I have got a question,
say, I have some HTML content.
My problem is, how do I substring the content to generate a preview of x letters of the pure text with leaving the markup structure fully functioning?

I tried to strip the tags, then substr my wanted length, and then I tried a make up a regexp to do the same to the HTML thing, but of course wouldn't work.

Thanks for your help.
ur right, sorry, okay:
eg, I have got this:
<b>bla blub</b><i>this</i> <br>is the content I want to extract</i> bu +t only some <span style="background:#ccc">part of it</span>


Now I want to provide a preview like that:
<b>bla blub</b><i>this <br>is the content I want to extract</i>but - +> more?


if I just did substr($content,0,40) I would get something like:
<b>bla blub</b><i>this <br>is the content I want t


and you see that the HTML markup is broken, which I would like not to have.

Replies are listed 'Best First'.
Re: Substringing HTML content
by William G. Davis (Friar) on Jan 07, 2005 at 17:56 UTC
Re: Substringing HTML content
by fuzzysteve (Beadle) on Jan 07, 2005 at 18:40 UTC
    While it's messy as hell, as an interim solution, write a subroutine that goes though it character by character which records the following things:

    total characters so far
    if you are in a tag (i.e. started with a <)
    what the last character was
    how many tags you are in
    how many open tags there are (img could be problematic with this. ditto on any other atomic tags. xhtml would be easier)
    what the open tags are and what order they are in

    from that you should be able to run through the characters, taking note of all the tags that you'll need to close at the end.

    not fun, and hopefully theres a better solution.
      thanks :)
      ah crap, I got rid of the whole idea
      I now use a div width a fixed height and overflow:hidden.

      have a good day
Re: Substringing HTML content
by dimar (Curate) on Jan 07, 2005 at 17:54 UTC

    It would help if you could explain a little more what you are trying to do. For example, Are you are attempting to show Keyword In Context (KWIC) results from an HTML search engine? If so, you may want to look at a premade tool like GLIMPSE.