comment on

Using an XML parser is generally a fairly simple task. Consider this code which extracts the data as you've described:

#!/usr/bin/env perl -l

use strict;
use warnings;

use XML::LibXML;

my $xml_file = 'pm_1149767_xml_parse.xml';
my $parser = XML::LibXML::->new();
my $doc = $parser->load_xml(location => $xml_file);
my $re = qr{%([^%]+)%};

for ($doc->findnodes('//span/text()')) {
    print $1 while /$re/g;
}
[download]

Opening and reading a file line-by-line is probably an equivalent amount of code. However, that doesn't take into account <span> elements spread over multiple lines. You show an ideal situation of:

<span ...>%var%</span>
[download]

However, what about the equally valid XML:

<span ...>
    %var%
</span>
[download]

The XML parser already has the code to do this. There's little point in attempting to reinvent this wheel; in fact, your chances of getting it completely right (before you've pulled out all of your hair) are small to none.

I've indicated 'pm_1149767_xml_parse.xml' in the code above. That's an XML file I've dummied up which contains your <span> elements at different levels of the XML hierarchy as well as a number of edge cases. Here it is:

<root>
    <A>
        <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f
+alse">%PN1%</span>
        <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f
+alse">
            %PN2%
        </span>
    </A>
    <B>
        <C>
            <span color="#231f20" textOverprint="false">%DIMMM%%DIMINC
+H%</span>
            <span color="#231f20" textOverprint="false">
                %DIMMM%
                %DIMINCH%
            </span>
            <span color="#231f20" textOverprint="false">%DIMMM%garbage
+%DIMINCH%</span>
            <span color="#231f20" textOverprint="false">%DIMMM%%%DIMIN
+CH%</span>
            <span color="#231f20" textOverprint="false">%DIMMM%%%%DIMI
+NCH%</span>
        </C>
    </B>
</root>
[download]

Here's the output from the script I've shown:

PN1
PN2
DIMMM
DIMINCH
DIMMM
DIMINCH
DIMMM
DIMINCH
DIMMM
DIMINCH
DIMMM
DIMINCH
[download]

It's possible you'll need more information than that for your report. In the spoiler below, you'll find a more involved for loop and more verbose output.

The XML parser I've used is XML::LibXML. I like this one because it's both handy for small demo scripts, such as I have here, and also suited to full-blown, commercial applications, where I've used it often. There's lots of others available on CPAN: pick one that suits you.

You'll probably also want to look at "XML Path Language (XPath) 3.1". That's a lengthy, W3C specification: I rarely need to reference more than the "3.3.5 Abbreviated Syntax" section.

— Ken

In reply to Re^2: Trouble capturing multiple groupings in regex by kcott
in thread Trouble capturing multiple groupings in regex by reverendphil

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.