Okay, I'm new to Perl, so this is probably a simple question...

I am attempting to edit an XML/XHTML document that was generated by a Quark extraction utility. The paragraphs in the document use nested span tags to apply formatting, and I am attempting to fix the issues associated with that.

For instance:

<p><span class="type1"><span class="type2">text </span>italictext<span + class="type2"> text</span></span></p>

In that example, the span class="type1" is applying an italic style to the entire paragraph. Then the type2 is applying a non-italic style to large sections of the paragraph and leaving individual words to be italicized.

Now, instances like this are easy to catch with a regex, but they can also be more involved:

<p><span class="type1"><span class="type2">text </span></span><span cl +ass="SmallCaps">text</span><span class="type1"><span class="type2"> t +ext </span>italictext<span class="type2"> text </span>italictext<span + class="type2"> text</span>italictext<span class="type2"> text </span +>italictext<span class="type2"> text </span>italictext<span class="ty +pe2"> text </span>italictext<span class="type2"> text.</span></span>< +/p>

Notice that the "SmallCaps" span is added in the middle of the paragraph and that there are multiple instances of the type2 tags.

Of course, I also have to deal with the possiblilty of the type2 tags being used to apply an italics style, like in this example I found:

<p>text <span class="type2">italictext</span>text<span class="type2">i +talictext</span>text</p>

What I would like to do is be able to match the opening and closing tags to each other and make adjustments as necessary to remove the extraneous mark-up. For instance, I want the first instace above to look like this:

<p>text <i>italictext</i> text</p>

I need to know if anyone can help me or direct me to an extremely simple example of/tutorial on the XML::Parser or HTML::Parser module, since I am sure that one of those does what I need to do. Again, I am very new at this, so any help will be greatly appreciated.


In reply to Parsing XML/HTML by sartzava

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.