punkish has asked for the wisdom of the Perl Monks concerning the following question:

I want to write a function that will take in htmlized text, and spit out the first n chars with the following rules -- The only way I can think of doing this is

1. init an array to hold html tags (@html) 2. read $in (the htmlized text char by char) 3. $n++ for every non-html char (that is, !~ <.*> or <\/.*> 4. add the char (html or otherwise) to $out 5. push each html open tag (<.*>) in @html 6. on encountering a close tag (<\/.*>, 6.1. search @html for its corresponding open tag 6.2. and delete it from the array 7. stop when $n reaches the limit 8. add closing tags for all remaining open tags in @html in reverse order to the end of $out 9. spit $out

Is that a reasonable approach? Is it too cumbersome? What are the pitfalls?

Update: Ok. This is the point at which I realize that I should really have stated the actual problem instead of psuedofying it. Here goes -- I wrote a wiki+blog+forums+PIM (that works very well for me, and I am quite proud of it ;-)). I enter wiki-formatted text and store it.

Then I set about building a RSS feed generator for it. I want to show only the begining x% of each entry, however, that entry has all the wiki markup in it (the *s and the /s and the =s, etc.). So, either I write something that strips all that out, but, in that case, mangles the sense of what that entry is about, or I format it per the html formatter that I wrote, then substringify the initial x%, in which case I am left with malformed html (usually unclosed list or pre or map tags wreak havoc). Hence, the above problem.

The "summarized" text is not going to stand on its own -- it will be embedded in an otherwise well-formed page.

To summarize my problem,

Btw, I know that my solution already has pitfalls in it... even minor ones in that I can't really read in char by char, because I really have to read in entire html tags.

Yup, I know parsing html is hairy... anyone who thinks it isn't should set about to build one. It is fun, but very frustrating fun.

--

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re: substr(ingifying) htmlized text
by sauoq (Abbot) on Sep 23, 2005 at 19:22 UTC
    Is that a reasonable approach? Is it too cumbersome? What are the pitfalls?

    No. Yes. Many and various.

    Parsing HTML is not so easy as using a few regular expressions so you should be starting with an HTML parser. There are plenty on CPAN. The solution after that point will be dependent on the parser you choose. You might build the data structure first, or do it with callbacks as you go along... But, essentially, you'll need to walk your tree, count your characters, and toss out the remaining branches you don't need.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: substr(ingifying) htmlized text
by bprew (Monk) on Sep 23, 2005 at 19:55 UTC

    It sounds like you're more interested in writing a function that acts as a limited HTML parser/validator, is there a reason that you don't want to use an existing HTML parser/validator?

    Barring being able to use existing code... your psuedo-code sounds reasonable, if you're not looking for a >90% solution.

    Although, parsing HTMl is not easy, as the many lines on HTML parsers on CPAN attest to, so depending on how loose you want your outgoing HTML to be, it might be possible to write it in a function.

    Also, the hard part with HTML is always the edge cases, and you have to work under the assumption that no one else knows how to write HTML and all their HTML is "sketchy" at best.

    For example, if you were given this piece of HTML:

    <font>some stuff <h3>some more</font> stuff</h3>

    Your function would have no tags left over, at least according to the psuedo-code, even though this may not be valid HTML.

    It depends on how close to valid HTML you want to get. If you have a specific need, then rolling your own is probably a good idea. However, if you are just looking to try and make HTML more valid... there are probably solutions out there for you.
    See HTML::Validator or HTML::Tidy or even HTML::TokeParser::Simple.

    Also, I would be more explicit about what happens when you find a closing tag with no appropriate opening tag. My guess is that you'll just throw it away, but its something to think about.

      For example, if you were given this piece of HTML:
      <font>some stuff <h3>some more</font> stuff</h3>
      You are absolutely correct. However, I won't face that problem because I am creating the html in the first place (see my update to the OP).
      --

      when small people start casting long shadows, it is time to go to bed
Re: substr(ingifying) htmlized text
by sk (Curate) on Sep 23, 2005 at 19:24 UTC
    Is it reasonable to assume your question is more to do with - can we fix a HTML file that is not well-formed?

    I am not sure if that is possible. Too many things to worry about.

    take this for example -

    <HTML> <HEAD> </BODY> </HEAD> </HTML>
    Now </BODY> will close <HEAD>. Well you scan backwards and forwards and pick the one which gives the valid HTML but i can surely come up with two errors that will make the program think it is a valid HTML. Unless you are going to check with Keywords it is going to hard to do this. Even if you check with keywords when someone misses the tag where will you put them?

    Sorry not much of help on the code front but just listing out issues. -SK

Re: substr(ingifying) htmlized text
by graff (Chancellor) on Sep 23, 2005 at 23:40 UTC
    I think your basic idea, of using a stack of html tags so you can close out open tags after truncating the text, is basically sound, and can be combined pretty easily with a good HTML parsing module.

    Here's a crude example that seems to work on some relatively simple HTML data that I tried. There is certainly room for improvement and there are bound to be situations in HTML that will cause it to go wrong, but it's a start...

      graff++

      While I was not expecting working code off my pseudo... you gave, and it works.

      I made the following mods --

      Substituted grep with a straightforward loop through the array and return 1 when compare is successful. This significantly improved the performance.

      Removed ::Simple to get to HTML::TokeParser directly (my webhost doesn't have H::T::S installed, and I didn't want to bother them... besides, perhaps getting to the base module directly perhaps squeezes out a little bit more performance.

      It works, and all thanks and credit to you.

      --

      when small people start casting long shadows, it is time to go to bed
Re: substr(ingifying) htmlized text
by Moron (Curate) on Sep 24, 2005 at 14:29 UTC
    There may be a learning curve if you are unfamiliar with tree structures, but html::tree contains a wealth of methods for loading html into a suitable memory structure and extracting the bits you want.

    -M

    Free your mind

Use HTML::Tidy (Re: substr(ingifying) htmlized text)
by Anonymous Monk on Sep 23, 2005 at 23:22 UTC

    The way I'd do it is to parse out the first ~1000 characters, tags and all, from the page, execute an s/<[^>]*$// to remove any truncated tag at the end, and then feed the result to HTML::Tidy.

    It should close any open tags and give you back a shiny, happy, valid HTML document fragment.

    2005-09-27 Retitled by g0n, as per Monastery guidelines
    Original title: 'HTML::Tidy'