moseley has asked for the wisdom of the Perl Monks concerning the following question:
Ok, this is your standard type of problem that perl should be good at. I'm trying to see a way to solve it without using HTML::Parser. I've got a string that may or may not include some HTML formatting, and the string is broken up into sections by double-dashed lines (think of breadcrumb links)
This is a -- string of -- words
<b>This is a -- string of -- words</b>
This <b>is a -- string</b> of -- words
So, the first one is plain text, the second is wrapped completely in a tag, and the third has a few words wrapped in the bold tag which spans the first and second sections.
And do note that the bold is just an example, it might be some other type of formatting such as a style tag or a font change or whatever, I do know it has a start tag and an end tag, and that's all.
Also, I don't know how many dashes, if any will be there.
Now, I need to break that into three links (split by the double dashed line), in a kind of "breadcrumb" type of link. So, looking at the third example, and just the first section, by splitting on the double-dash I'm also splitting up the start and end tags, so I'll need to supply the closing tag after the word "a":
This <b>is a -- string</b> of -- wordsWould result in (for the first segment):
This <b>is a</b>I can do this by tokenizing and keeping track of what words are in the tag and what words are not, but I'm wondering how others would attack this problem. Thanks,
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: A nice text processing question
by belden (Friar) on Jan 05, 2002 at 15:07 UTC | |
by moseley (Acolyte) on Jan 05, 2002 at 20:05 UTC | |
by belden (Friar) on Jan 06, 2002 at 00:25 UTC | |
by dragonchild (Archbishop) on Jan 07, 2002 at 19:22 UTC |