in reply to Re: Trouble capturing multiple groupings in regex
in thread Trouble capturing multiple groupings in regex
Using an XML parser is generally a fairly simple task. Consider this code which extracts the data as you've described:
#!/usr/bin/env perl -l use strict; use warnings; use XML::LibXML; my $xml_file = 'pm_1149767_xml_parse.xml'; my $parser = XML::LibXML::->new(); my $doc = $parser->load_xml(location => $xml_file); my $re = qr{%([^%]+)%}; for ($doc->findnodes('//span/text()')) { print $1 while /$re/g; }
Opening and reading a file line-by-line is probably an equivalent amount of code. However, that doesn't take into account <span> elements spread over multiple lines. You show an ideal situation of:
<span ...>%var%</span>
However, what about the equally valid XML:
<span ...> %var% </span>
The XML parser already has the code to do this. There's little point in attempting to reinvent this wheel; in fact, your chances of getting it completely right (before you've pulled out all of your hair) are small to none.
I've indicated 'pm_1149767_xml_parse.xml' in the code above. That's an XML file I've dummied up which contains your <span> elements at different levels of the XML hierarchy as well as a number of edge cases. Here it is:
<root> <A> <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f +alse">%PN1%</span> <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f +alse"> %PN2% </span> </A> <B> <C> <span color="#231f20" textOverprint="false">%DIMMM%%DIMINC +H%</span> <span color="#231f20" textOverprint="false"> %DIMMM% %DIMINCH% </span> <span color="#231f20" textOverprint="false">%DIMMM%garbage +%DIMINCH%</span> <span color="#231f20" textOverprint="false">%DIMMM%%%DIMIN +CH%</span> <span color="#231f20" textOverprint="false">%DIMMM%%%%DIMI +NCH%</span> </C> </B> </root>
Here's the output from the script I've shown:
PN1 PN2 DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH
It's possible you'll need more information than that for your report. In the spoiler below, you'll find a more involved for loop and more verbose output.
The XML parser I've used is XML::LibXML. I like this one because it's both handy for small demo scripts, such as I have here, and also suited to full-blown, commercial applications, where I've used it often. There's lots of others available on CPAN: pick one that suits you.
You'll probably also want to look at "XML Path Language (XPath) 3.1". That's a lengthy, W3C specification: I rarely need to reference more than the "3.3.5 Abbreviated Syntax" section.
— Ken
|
|---|