sartzava has asked for the wisdom of the Perl Monks concerning the following question:
Okay, I'm new to Perl, so this is probably a simple question...
I am attempting to edit an XML/XHTML document that was generated by a Quark extraction utility. The paragraphs in the document use nested span tags to apply formatting, and I am attempting to fix the issues associated with that.
For instance:
<p><span class="type1"><span class="type2">text </span>italictext<span + class="type2"> text</span></span></p>
In that example, the span class="type1" is applying an italic style to the entire paragraph. Then the type2 is applying a non-italic style to large sections of the paragraph and leaving individual words to be italicized.
Now, instances like this are easy to catch with a regex, but they can also be more involved:
<p><span class="type1"><span class="type2">text </span></span><span cl +ass="SmallCaps">text</span><span class="type1"><span class="type2"> t +ext </span>italictext<span class="type2"> text </span>italictext<span + class="type2"> text</span>italictext<span class="type2"> text </span +>italictext<span class="type2"> text </span>italictext<span class="ty +pe2"> text </span>italictext<span class="type2"> text.</span></span>< +/p>
Notice that the "SmallCaps" span is added in the middle of the paragraph and that there are multiple instances of the type2 tags.
Of course, I also have to deal with the possiblilty of the type2 tags being used to apply an italics style, like in this example I found:
<p>text <span class="type2">italictext</span>text<span class="type2">i +talictext</span>text</p>
What I would like to do is be able to match the opening and closing tags to each other and make adjustments as necessary to remove the extraneous mark-up. For instance, I want the first instace above to look like this:
<p>text <i>italictext</i> text</p>
I need to know if anyone can help me or direct me to an extremely simple example of/tutorial on the XML::Parser or HTML::Parser module, since I am sure that one of those does what I need to do. Again, I am very new at this, so any help will be greatly appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Parsing XML/HTML
by satchm0h (Beadle) on Apr 08, 2005 at 16:06 UTC | |
|
Re: Parsing XML/HTML
by rg0now (Chaplain) on Apr 08, 2005 at 15:41 UTC |