comment on

Okay, I'm new to Perl, so this is probably a simple question...

I am attempting to edit an XML/XHTML document that was generated by a Quark extraction utility. The paragraphs in the document use nested span tags to apply formatting, and I am attempting to fix the issues associated with that.

For instance:

<p><span class="type1"><span class="type2">text </span>italictext<span
+ class="type2"> text</span></span></p>
[download]

In that example, the span class="type1" is applying an italic style to the entire paragraph. Then the type2 is applying a non-italic style to large sections of the paragraph and leaving individual words to be italicized.

Now, instances like this are easy to catch with a regex, but they can also be more involved:

<p><span class="type1"><span class="type2">text </span></span><span cl
+ass="SmallCaps">text</span><span class="type1"><span class="type2"> t
+ext </span>italictext<span class="type2"> text </span>italictext<span
+ class="type2"> text</span>italictext<span class="type2"> text </span
+>italictext<span class="type2"> text </span>italictext<span class="ty
+pe2"> text </span>italictext<span class="type2"> text.</span></span><
+/p>
[download]

Notice that the "SmallCaps" span is added in the middle of the paragraph and that there are multiple instances of the type2 tags.

Of course, I also have to deal with the possiblilty of the type2 tags being used to apply an italics style, like in this example I found:

<p>text <span class="type2">italictext</span>text<span class="type2">i
+talictext</span>text</p>
[download]

What I would like to do is be able to match the opening and closing tags to each other and make adjustments as necessary to remove the extraneous mark-up. For instance, I want the first instace above to look like this:

<p>text <i>italictext</i> text</p>
[download]

I need to know if anyone can help me or direct me to an extremely simple example of/tutorial on the XML::Parser or HTML::Parser module, since I am sure that one of those does what I need to do. Again, I am very new at this, so any help will be greatly appreciated.

In reply to Parsing XML/HTML by sartzava

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.