punkish has asked for the wisdom of the Perl Monks concerning the following question:
1. init an array to hold html tags (@html) 2. read $in (the htmlized text char by char) 3. $n++ for every non-html char (that is, !~ <.*> or <\/.*> 4. add the char (html or otherwise) to $out 5. push each html open tag (<.*>) in @html 6. on encountering a close tag (<\/.*>, 6.1. search @html for its corresponding open tag 6.2. and delete it from the array 7. stop when $n reaches the limit 8. add closing tags for all remaining open tags in @html in reverse order to the end of $out 9. spit $out
Is that a reasonable approach? Is it too cumbersome? What are the pitfalls?
Update: Ok. This is the point at which I realize that I should really have stated the actual problem instead of psuedofying it. Here goes -- I wrote a wiki+blog+forums+PIM (that works very well for me, and I am quite proud of it ;-)). I enter wiki-formatted text and store it.
Then I set about building a RSS feed generator for it. I want to show only the begining x% of each entry, however, that entry has all the wiki markup in it (the *s and the /s and the =s, etc.). So, either I write something that strips all that out, but, in that case, mangles the sense of what that entry is about, or I format it per the html formatter that I wrote, then substringify the initial x%, in which case I am left with malformed html (usually unclosed list or pre or map tags wreak havoc). Hence, the above problem.
The "summarized" text is not going to stand on its own -- it will be embedded in an otherwise well-formed page.
To summarize my problem,
Yup, I know parsing html is hairy... anyone who thinks it isn't should set about to build one. It is fun, but very frustrating fun.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: substr(ingifying) htmlized text
by sauoq (Abbot) on Sep 23, 2005 at 19:22 UTC | |
Re: substr(ingifying) htmlized text
by bprew (Monk) on Sep 23, 2005 at 19:55 UTC | |
by punkish (Priest) on Sep 23, 2005 at 21:09 UTC | |
Re: substr(ingifying) htmlized text
by sk (Curate) on Sep 23, 2005 at 19:24 UTC | |
Re: substr(ingifying) htmlized text
by graff (Chancellor) on Sep 23, 2005 at 23:40 UTC | |
by punkish (Priest) on Sep 24, 2005 at 16:59 UTC | |
Re: substr(ingifying) htmlized text
by Moron (Curate) on Sep 24, 2005 at 14:29 UTC | |
Use HTML::Tidy (Re: substr(ingifying) htmlized text)
by Anonymous Monk on Sep 23, 2005 at 23:22 UTC |