Trouble capturing multiple groupings in regex

reverendphil has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Trouble capturing multiple groupings in regex by stevieb (Canon) on Dec 09, 2015 at 14:37 UTC
Here's one way to do it. It creates a new array which contains all matches for each line of input: `use warnings; use strict; while (<DATA>){ my @matches; while (/%([^%]+)%/g){ push @matches, $1; } print join ' ', @matches; print "\n"; } __DATA__ <span color="#231f20" textOverprint="false">%PN1%</span> <span color="#231f20" textOverprint="false">%DIMMM%%DIMINCH%</span> __END__ PN1 DIMMM DIMINCH` [download]	[reply] [d/l]
Re^2: Trouble capturing multiple groupings in regex by reverendphil (Initiate) on Dec 09, 2015 at 15:07 UTC
I've updated my post. I should've included examples of the %variable% tag being found in places where it is to be skipped, highlighting the importance of my bounding < and > in the original matching pattern. I have a main bounding area delimited by the >< chars, and a set of fields within that bounded by %'s.	[reply]
Re^3: Trouble capturing multiple groupings in regex by Corion (Patriarch) on Dec 09, 2015 at 15:14 UTC
After thinking a bit more about this, the following approach using look-around assertion works: `use warnings; use strict; while (<DATA>){ my @matches; @matches = (/(?<=[%>])%([^%]+)%(?=[%<])/g); print join ' ', @matches; print "\n"; } __DATA__ <span color="#231f20" someattr="%do_not_match%" textOverprint="false"> +%PN1%</span> <span color="#231f20" someattr="%do_not_match%" textOverprint="false"> +%DIMMM%%DIMINCH%</span> __END__ PN1 DIMMM DIMINCH` [download]	[reply] [d/l]
Re^4: Trouble capturing multiple groupings in regex by reverendphil (Initiate) on Dec 09, 2015 at 15:24 UTC
Re^5: Trouble capturing multiple groupings in regex by AnomalousMonk (Archbishop) on Dec 09, 2015 at 16:27 UTC
Re: Trouble capturing multiple groupings in regex (skip) by tye (Sage) on Dec 09, 2015 at 15:48 UTC
Usually you do this type of thing because you want to replace the values, which I'd do like: `s{(<[^>]+>)\|%([^%]+)%}{ $1 \|\| $replace{$2} // "%$2%" }ge;` [download] To just fetch the names, I'd do: `my @matches = grep defined, $data =~ m{<[^>]+>\|%([^%]+)%}g` [download] Avoiding complex constructs that are so easy to get wrong. - tye	[reply] [d/l] [select]
Re: Trouble capturing multiple groupings in regex by Corion (Patriarch) on Dec 09, 2015 at 14:34 UTC
The problem is that the repeated capturing overwrites the "inner" capture group so you won't be able to get more than one result from something like `/(foo)*/g`. How much of the input do you control? Would it be feasible to just match any letters between `%...%` ? `my @matches = ($data =~ m/(?:%(\w+)%)/g );` [download]	[reply] [d/l] [select]
Re: Trouble capturing multiple groupings in regex by AnomalousMonk (Archbishop) on Dec 09, 2015 at 16:18 UTC
Another approach, with some attempt to make parsing more tolerant of variations in format: c:\@Work\Perl>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $start = qr{ > \s* }xms; my $more = qr{ \G (?<! \A) }xms; my $post = qr{ \s* <? }xms; ;; for my $s ( '<span c=\"#12\" foo=\"%DoNotMatch%\" bozz=\"false\">%PN1%</span>', '<span c=\"#98\" bar=\"false\" zot=\"%NoNoNo%\"> %DIMMM% %DIMINCH% +</span>', ) { print qq{'$s'}; my @matches = $s =~ m{ (?: $more \| $start) % ([^%]+) % $post }xmsg; dd \@matches; } " '<span c="#12" foo="%DoNotMatch%" bozz="false">%PN1%</span>' ["PN1"] '<span c="#98" bar="false" zot="%NoNoNo%"> %DIMMM% %DIMINCH% </span>' ["DIMMM", "DIMINCH"] [download] Please see perlre, perlretut, and perlrequick. Caveat: any "pure regex" approach to parsing XML is fragile, probably very fragile. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Trouble capturing multiple groupings in regex by Ratazong (Monsignor) on Dec 09, 2015 at 14:43 UTC
The following works for me: `my $input1 ="<span color=\"#231f20\" textOverprint=\"false\">%PN1%</sp +an>"; my $input2 ="<span color=\"#231f20\" textOverprint=\"false\">%DIMMM%%D +IMINCH%</span>"; my @matches = $input1 =~ m/%([^%]+)%/g ; print @matches,"\n"; @matches = $input2 =~ m/%([^%]+)%/g ; print @matches[0],"___",@matches[1],"\n";` [download] HTH, Rata	[reply] [d/l]
Re: Trouble capturing multiple groupings in regex by reverendphil (Initiate) on Dec 09, 2015 at 16:49 UTC
Thanks everyone. As expected.. the actual data I got to look at is formatted differently than the sample data I'm looking at, and unless I can find where they store these objects in that other format, I'm going to have to go the route of properly parsing XML as people suggested. I was looking at a 'quick' method of just pulling out a mostly accurate listing of variables being used in each of these templates to spot check for consistency in naming and usage, and I may have overestimated my ability to 'quickly do this' considering the documents I'm actually working with are not a match for the samples I had, and not entirely understood at this point. Really do appreciate the help, and this still might come in handy when I can find the appropriately formatted documents, but if I want to build this report I'm probably spending a bit more time and parsing the XML after I can understand how it's built better.	[reply]
Re^2: Trouble capturing multiple groupings in regex by kcott (Archbishop) on Dec 09, 2015 at 20:56 UTC
Using an XML parser is generally a fairly simple task. Consider this code which extracts the data as you've described: `#!/usr/bin/env perl -l use strict; use warnings; use XML::LibXML; my $xml_file = 'pm_1149767_xml_parse.xml'; my $parser = XML::LibXML::->new(); my $doc = $parser->load_xml(location => $xml_file); my $re = qr{%([^%]+)%}; for ($doc->findnodes('//span/text()')) { print $1 while /$re/g; }` [download] Opening and reading a file line-by-line is probably an equivalent amount of code. However, that doesn't take into account `<span>` elements spread over multiple lines. You show an ideal situation of: `<span ...>%var%</span>` [download] However, what about the equally valid XML: `<span ...> %var% </span>` [download] The XML parser already has the code to do this. There's little point in attempting to reinvent this wheel; in fact, your chances of getting it completely right (before you've pulled out all of your hair) are small to none. I've indicated '`pm_1149767_xml_parse.xml`' in the code above. That's an XML file I've dummied up which contains your `<span>` elements at different levels of the XML hierarchy as well as a number of edge cases. Here it is: <root> <A> <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f +alse">%PN1%</span> <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f +alse"> %PN2% </span> </A> <B> <C> <span color="#231f20" textOverprint="false">%DIMMM%%DIMINC +H%</span> <span color="#231f20" textOverprint="false"> %DIMMM% %DIMINCH% </span> <span color="#231f20" textOverprint="false">%DIMMM%garbage +%DIMINCH%</span> <span color="#231f20" textOverprint="false">%DIMMM%%%DIMIN +CH%</span> <span color="#231f20" textOverprint="false">%DIMMM%%%%DIMI +NCH%</span> </C> </B> </root> [download] Here's the output from the script I've shown: `PN1 PN2 DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH` [download] It's possible you'll need more information than that for your report. In the spoiler below, you'll find a more involved `for` loop and more verbose output. With this `for` loop: `for my $context ($doc->findnodes('//span')) { print $context; for my $text ($context->findnodes('text()')) { print $text; while ($text =~ /$re/g) { print $1; } } }` [download] You'll get this output: <span color="#231f20" whatever="%DoNotMatch%" textOverprint="false">%P +N1%</span> %PN1% PN1 <span color="#231f20" whatever="%DoNotMatch%" textOverprint="false"> %PN2% </span> %PN2% PN2 <span color="#231f20" textOverprint="false">%DIMMM%%DIMINCH%</span> %DIMMM%%DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false"> %DIMMM% %DIMINCH% </span> %DIMMM% %DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false">%DIMMM%garbage%DIMINCH%</s +pan> %DIMMM%garbage%DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false">%DIMMM%%%DIMINCH%</span> %DIMMM%%%DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false">%DIMMM%%%%DIMINCH%</span> %DIMMM%%%%DIMINCH% DIMMM DIMINCH [download] The XML parser I've used is XML::LibXML. I like this one because it's both handy for small demo scripts, such as I have here, and also suited to full-blown, commercial applications, where I've used it often. There's lots of others available on CPAN: pick one that suits you. You'll probably also want to look at "XML Path Language (XPath) 3.1". That's a lengthy, W3C specification: I rarely need to reference more than the "3.3.5 Abbreviated Syntax" section. — Ken	[reply] [d/l] [select]