in reply to substr(ingifying) htmlized text

Is that a reasonable approach? Is it too cumbersome? What are the pitfalls?

No. Yes. Many and various.

Parsing HTML is not so easy as using a few regular expressions so you should be starting with an HTML parser. There are plenty on CPAN. The solution after that point will be dependent on the parser you choose. You might build the data structure first, or do it with callbacks as you go along... But, essentially, you'll need to walk your tree, count your characters, and toss out the remaining branches you don't need.

-sauoq
"My two cents aren't worth a dime.";