substr(ingifying) htmlized text

punkish has asked for the wisdom of the Perl Monks concerning the following question:

I want to write a function that will take in htmlized text, and spit out the first n chars with the following rules --

don't count the html markup in the n chars
close all the unclosed markups in order to make a reasonably valid html

The only way I can think of doing this is

1. init an array to hold html tags (@html)
2. read $in (the htmlized text char by char)
3. $n++ for every non-html char (that is, !~ <.*> or <\/.*>
4. add the char (html or otherwise) to $out
5. push each html open tag (<.*>) in @html
6. on encountering a close tag (<\/.*>, 
   6.1. search @html for its corresponding open tag
   6.2. and delete it from the array
7. stop when $n reaches the limit
8. add closing tags for all remaining open tags in @html 
   in reverse order to the end of $out
9. spit $out
[download]

Is that a reasonable approach? Is it too cumbersome? What are the pitfalls?

Update: Ok. This is the point at which I realize that I should really have stated the actual problem instead of psuedofying it. Here goes -- I wrote a wiki+blog+forums+PIM (that works very well for me, and I am quite proud of it ;-)). I enter wiki-formatted text and store it.

Then I set about building a RSS feed generator for it. I want to show only the begining x% of each entry, however, that entry has all the wiki markup in it (the *s and the /s and the =s, etc.). So, either I write something that strips all that out, but, in that case, mangles the sense of what that entry is about, or I format it per the html formatter that I wrote, then substringify the initial x%, in which case I am left with malformed html (usually unclosed list or pre or map tags wreak havoc). Hence, the above problem.

The "summarized" text is not going to stand on its own -- it will be embedded in an otherwise well-formed page.

To summarize my problem,

I don't want something to convert wiki format into html. I've got that.
I don't want something to just chop out the text. It won't work.
I don't want to just parse existing html
I really want to extract a portion of valid htmlized text and still have the extracted portion remain valid

Btw, I know that my solution already has pitfalls in it... even minor ones in that I can't really read in char by char, because I really have to read in entire html tags.

Yup, I know parsing html is hairy... anyone who thinks it isn't should set about to build one. It is fun, but very frustrating fun.

--

when small people start casting long shadows, it is time to go to bed

Comment on substr(ingifying) htmlized text Download Code

Replies are listed 'Best First'.
Re: substr(ingifying) htmlized text by sauoq (Abbot) on Sep 23, 2005 at 19:22 UTC
Is that a reasonable approach? Is it too cumbersome? What are the pitfalls? No. Yes. Many and various. Parsing HTML is not so easy as using a few regular expressions so you should be starting with an HTML parser. There are plenty on CPAN. The solution after that point will be dependent on the parser you choose. You might build the data structure first, or do it with callbacks as you go along... But, essentially, you'll need to walk your tree, count your characters, and toss out the remaining branches you don't need. -sauoq "My two cents aren't worth a dime.";	[reply]
Re: substr(ingifying) htmlized text by bprew (Monk) on Sep 23, 2005 at 19:55 UTC
It sounds like you're more interested in writing a function that acts as a limited HTML parser/validator, is there a reason that you don't want to use an existing HTML parser/validator? Barring being able to use existing code... your psuedo-code sounds reasonable, if you're not looking for a >90% solution. Although, parsing HTMl is not easy, as the many lines on HTML parsers on CPAN attest to, so depending on how loose you want your outgoing HTML to be, it might be possible to write it in a function. Also, the hard part with HTML is always the edge cases, and you have to work under the assumption that no one else knows how to write HTML and all their HTML is "sketchy" at best. For example, if you were given this piece of HTML: `<font>some stuff <h3>some more</font> stuff</h3>` [download] Your function would have no tags left over, at least according to the psuedo-code, even though this may not be valid HTML. It depends on how close to valid HTML you want to get. If you have a specific need, then rolling your own is probably a good idea. However, if you are just looking to try and make HTML more valid... there are probably solutions out there for you. See HTML::Validator or HTML::Tidy or even HTML::TokeParser::Simple. Also, I would be more explicit about what happens when you find a closing tag with no appropriate opening tag. My guess is that you'll just throw it away, but its something to think about.	[reply] [d/l]
Re^2: substr(ingifying) htmlized text by punkish (Priest) on Sep 23, 2005 at 21:09 UTC
For example, if you were given this piece of HTML: `<font>some stuff <h3>some more</font> stuff</h3>` [download] You are absolutely correct. However, I won't face that problem because I am creating the html in the first place (see my update to the OP). -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l]
Re: substr(ingifying) htmlized text by sk (Curate) on Sep 23, 2005 at 19:24 UTC
Is it reasonable to assume your question is more to do with - can we fix a HTML file that is not well-formed? I am not sure if that is possible. Too many things to worry about. take this for example - `<HTML> <HEAD> </BODY> </HEAD> </HTML>` [download] Now </BODY> will close <HEAD>. Well you scan backwards and forwards and pick the one which gives the valid HTML but i can surely come up with two errors that will make the program think it is a valid HTML. Unless you are going to check with Keywords it is going to hard to do this. Even if you check with keywords when someone misses the tag where will you put them? Sorry not much of help on the code front but just listing out issues. -SK	[reply] [d/l]
Re: substr(ingifying) htmlized text by graff (Chancellor) on Sep 23, 2005 at 23:40 UTC
I think your basic idea, of using a stack of html tags so you can close out open tags after truncating the text, is basically sound, and can be combined pretty easily with a good HTML parsing module. Here's a crude example that seems to work on some relatively simple HTML data that I tried. There is certainly room for improvement and there are bound to be situations in HTML that will cause it to go wrong, but it's a start... Read more... (2 kB)	[reply] [d/l]
Re^2: substr(ingifying) htmlized text by punkish (Priest) on Sep 24, 2005 at 16:59 UTC
graff++ While I was not expecting working code off my pseudo... you gave, and it works. I made the following mods -- Substituted `grep` with a straightforward loop through the array and `return 1` when compare is successful. This significantly improved the performance. Removed ::Simple to get to HTML::TokeParser directly (my webhost doesn't have H::T::S installed, and I didn't want to bother them... besides, perhaps getting to the base module directly perhaps squeezes out a little bit more performance. It works, and all thanks and credit to you. -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l] [select]
Re: substr(ingifying) htmlized text by Moron (Curate) on Sep 24, 2005 at 14:29 UTC
There may be a learning curve if you are unfamiliar with tree structures, but html::tree contains a wealth of methods for loading html into a suitable memory structure and extracting the bits you want. -M Free your mind	[reply]
Use HTML::Tidy (Re: substr(ingifying) htmlized text) by Anonymous Monk on Sep 23, 2005 at 23:22 UTC
The way I'd do it is to parse out the first ~1000 characters, tags and all, from the page, execute an `s/<[^>]*$//` to remove any truncated tag at the end, and then feed the result to HTML::Tidy. It should close any open tags and give you back a shiny, happy, valid HTML document fragment. 2005-09-27 Retitled by g0n, as per Monastery guidelines Original title: 'HTML::Tidy'	[reply]