Using an XML parser is generally a fairly simple task. Consider this code which extracts the data as you've described:
#!/usr/bin/env perl -l
use strict;
use warnings;
use XML::LibXML;
my $xml_file = 'pm_1149767_xml_parse.xml';
my $parser = XML::LibXML::->new();
my $doc = $parser->load_xml(location => $xml_file);
my $re = qr{%([^%]+)%};
for ($doc->findnodes('//span/text()')) {
print $1 while /$re/g;
}
Opening and reading a file line-by-line is probably an equivalent amount of code.
However, that doesn't take into account <span> elements spread over multiple lines.
You show an ideal situation of:
<span ...>%var%</span>
However, what about the equally valid XML:
<span ...>
%var%
</span>
The XML parser already has the code to do this.
There's little point in attempting to reinvent this wheel;
in fact, your chances of getting it completely right (before you've pulled out all of your hair) are small to none.
I've indicated 'pm_1149767_xml_parse.xml' in the code above.
That's an XML file I've dummied up which contains your <span> elements at different levels of the XML hierarchy as well as a number of edge cases. Here it is:
<root>
<A>
<span color="#231f20" whatever="%DoNotMatch%" textOverprint="f
+alse">%PN1%</span>
<span color="#231f20" whatever="%DoNotMatch%" textOverprint="f
+alse">
%PN2%
</span>
</A>
<B>
<C>
<span color="#231f20" textOverprint="false">%DIMMM%%DIMINC
+H%</span>
<span color="#231f20" textOverprint="false">
%DIMMM%
%DIMINCH%
</span>
<span color="#231f20" textOverprint="false">%DIMMM%garbage
+%DIMINCH%</span>
<span color="#231f20" textOverprint="false">%DIMMM%%%DIMIN
+CH%</span>
<span color="#231f20" textOverprint="false">%DIMMM%%%%DIMI
+NCH%</span>
</C>
</B>
</root>
Here's the output from the script I've shown:
PN1
PN2
DIMMM
DIMINCH
DIMMM
DIMINCH
DIMMM
DIMINCH
DIMMM
DIMINCH
DIMMM
DIMINCH
It's possible you'll need more information than that for your report.
In the spoiler below, you'll find a more involved for loop and more verbose output.
With this for loop:
for my $context ($doc->findnodes('//span')) {
print $context;
for my $text ($context->findnodes('text()')) {
print $text;
while ($text =~ /$re/g) {
print $1;
}
}
}
You'll get this output:
<span color="#231f20" whatever="%DoNotMatch%" textOverprint="false">%P
+N1%</span>
%PN1%
PN1
<span color="#231f20" whatever="%DoNotMatch%" textOverprint="false">
%PN2%
</span>
%PN2%
PN2
<span color="#231f20" textOverprint="false">%DIMMM%%DIMINCH%</span>
%DIMMM%%DIMINCH%
DIMMM
DIMINCH
<span color="#231f20" textOverprint="false">
%DIMMM%
%DIMINCH%
</span>
%DIMMM%
%DIMINCH%
DIMMM
DIMINCH
<span color="#231f20" textOverprint="false">%DIMMM%garbage%DIMINCH%</s
+pan>
%DIMMM%garbage%DIMINCH%
DIMMM
DIMINCH
<span color="#231f20" textOverprint="false">%DIMMM%%%DIMINCH%</span>
%DIMMM%%%DIMINCH%
DIMMM
DIMINCH
<span color="#231f20" textOverprint="false">%DIMMM%%%%DIMINCH%</span>
%DIMMM%%%%DIMINCH%
DIMMM
DIMINCH
The XML parser I've used is XML::LibXML.
I like this one because it's both handy for small demo scripts, such as I have here, and also suited to full-blown, commercial applications, where I've used it often.
There's lots of others available on CPAN: pick one that suits you.
You'll probably also want to look at "XML Path Language (XPath) 3.1".
That's a lengthy, W3C specification: I rarely need to reference more than the "3.3.5 Abbreviated Syntax" section.
|