reverendphil has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks. Been so long I can't even remember my old username/email, but I find myself dusting off my swiss army knife to analyze some data again, and I'm *so close* to good, but stuck.

I'm trying to parse out what is essentially variable data placeholders in a file, delimited by %'s and outside of some xml tags, so you might see somexml....>%avariable%<somexml...

The problem arises when the user puts multiple fields in a given location, such as >%first%%second%<, and once I noticed that issue and adjusted my regex, the best I could get was capturing the second variable and skipping the first. I'm trying *not* to capture the bounding characters, just the text within. Here's a sample of a portion of data that will be parsed:

<span color="#231f20" whatever="%DoNotMatch%" textOverprint="false">%P +N1%</span> <span color="#231f20" textOverprint="false">%DIMMM%%DIMINCH%</span>

I'm attempting to pull PN1, DIMMM, and DIMINCH from this text block. Here's the closest I've gotten:

    my @matches = ($data =~ m/>(?:%([^%]+)%)+</g );

In this scenario, I'm getting PN1, DIMINCH. It's matching the full >%DIMMM%%DIMINCH%< string, but only capturing the second portion. I'm unable to figure out how to repeat the delimiting characters as well as the match target itself, without capturing the delimiting characters. Any help would be appreciated.

edit: Based on replies, here's some more info. I'm parsing out all of the lines within a file. I can't guarantee line breaks, so you could have the sample with in a single 'line', and I'm currently slurping the file into one string. Additionally, there are other instances of %blah% within the xml, so I can't just match on that string, I do need the bounding >% and %< overall, to avoid matching those pieces.

Replies are listed 'Best First'.
Re: Trouble capturing multiple groupings in regex
by stevieb (Canon) on Dec 09, 2015 at 14:37 UTC

    Here's one way to do it. It creates a new array which contains all matches for each line of input:

    use warnings; use strict; while (<DATA>){ my @matches; while (/%([^%]+)%/g){ push @matches, $1; } print join ' ', @matches; print "\n"; } __DATA__ <span color="#231f20" textOverprint="false">%PN1%</span> <span color="#231f20" textOverprint="false">%DIMMM%%DIMINCH%</span> __END__ PN1 DIMMM DIMINCH

      I've updated my post. I should've included examples of the %variable% tag being found in places where it is to be skipped, highlighting the importance of my bounding < and > in the original matching pattern. I have a main bounding area delimited by the >< chars, and a set of fields within that bounded by %'s.

        After thinking a bit more about this, the following approach using look-around assertion works:

        use warnings; use strict; while (<DATA>){ my @matches; @matches = (/(?<=[%>])%([^%]+)%(?=[%<])/g); print join ' ', @matches; print "\n"; } __DATA__ <span color="#231f20" someattr="%do_not_match%" textOverprint="false"> +%PN1%</span> <span color="#231f20" someattr="%do_not_match%" textOverprint="false"> +%DIMMM%%DIMINCH%</span> __END__ PN1 DIMMM DIMINCH
Re: Trouble capturing multiple groupings in regex (skip)
by tye (Sage) on Dec 09, 2015 at 15:48 UTC

    Usually you do this type of thing because you want to replace the values, which I'd do like:

    s{(<[^>]+>)|%([^%]+)%}{ $1 || $replace{$2} // "%$2%" }ge;

    To just fetch the names, I'd do:

    my @matches = grep defined, $data =~ m{<[^>]+>|%([^%]+)%}g

    Avoiding complex constructs that are so easy to get wrong.

    - tye        

Re: Trouble capturing multiple groupings in regex
by Corion (Patriarch) on Dec 09, 2015 at 14:34 UTC

    The problem is that the repeated capturing overwrites the "inner" capture group so you won't be able to get more than one result from something like /(foo)*/g.

    How much of the input do you control? Would it be feasible to just match any letters between %...% ?

    my @matches = ($data =~ m/(?:%(\w+)%)/g );
Re: Trouble capturing multiple groupings in regex
by AnomalousMonk (Archbishop) on Dec 09, 2015 at 16:18 UTC

    Another approach, with some attempt to make parsing more tolerant of variations in format:

    c:\@Work\Perl>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $start = qr{ > \s* }xms; my $more = qr{ \G (?<! \A) }xms; my $post = qr{ \s* <? }xms; ;; for my $s ( '<span c=\"#12\" foo=\"%DoNotMatch%\" bozz=\"false\">%PN1%</span>', '<span c=\"#98\" bar=\"false\" zot=\"%NoNoNo%\"> %DIMMM% %DIMINCH% +</span>', ) { print qq{'$s'}; my @matches = $s =~ m{ (?: $more | $start) % ([^%]+) % $post }xmsg; dd \@matches; } " '<span c="#12" foo="%DoNotMatch%" bozz="false">%PN1%</span>' ["PN1"] '<span c="#98" bar="false" zot="%NoNoNo%"> %DIMMM% %DIMINCH% </span>' ["DIMMM", "DIMINCH"]
    Please see perlre, perlretut, and perlrequick. Caveat: any "pure regex" approach to parsing XML is fragile, probably very fragile.


    Give a man a fish:  <%-{-{-{-<

Re: Trouble capturing multiple groupings in regex
by Ratazong (Monsignor) on Dec 09, 2015 at 14:43 UTC

    The following works for me:

    my $input1 ="<span color=\"#231f20\" textOverprint=\"false\">%PN1%</sp +an>"; my $input2 ="<span color=\"#231f20\" textOverprint=\"false\">%DIMMM%%D +IMINCH%</span>"; my @matches = $input1 =~ m/%([^%]+)%/g ; print @matches,"\n"; @matches = $input2 =~ m/%([^%]+)%/g ; print @matches[0],"___",@matches[1],"\n";

    HTH, Rata

Re: Trouble capturing multiple groupings in regex
by reverendphil (Initiate) on Dec 09, 2015 at 16:49 UTC

    Thanks everyone. As expected.. the actual data I got to look at is formatted differently than the sample data I'm looking at, and unless I can find where they store these objects in that other format, I'm going to have to go the route of properly parsing XML as people suggested. I was looking at a 'quick' method of just pulling out a mostly accurate listing of variables being used in each of these templates to spot check for consistency in naming and usage, and I may have overestimated my ability to 'quickly do this' considering the documents I'm actually working with are not a match for the samples I had, and not entirely understood at this point. Really do appreciate the help, and this still might come in handy when I can find the appropriately formatted documents, but if I want to build this report I'm probably spending a bit more time and parsing the XML after I can understand how it's built better.

      Using an XML parser is generally a fairly simple task. Consider this code which extracts the data as you've described:

      #!/usr/bin/env perl -l use strict; use warnings; use XML::LibXML; my $xml_file = 'pm_1149767_xml_parse.xml'; my $parser = XML::LibXML::->new(); my $doc = $parser->load_xml(location => $xml_file); my $re = qr{%([^%]+)%}; for ($doc->findnodes('//span/text()')) { print $1 while /$re/g; }

      Opening and reading a file line-by-line is probably an equivalent amount of code. However, that doesn't take into account <span> elements spread over multiple lines. You show an ideal situation of:

      <span ...>%var%</span>

      However, what about the equally valid XML:

      <span ...> %var% </span>

      The XML parser already has the code to do this. There's little point in attempting to reinvent this wheel; in fact, your chances of getting it completely right (before you've pulled out all of your hair) are small to none.

      I've indicated 'pm_1149767_xml_parse.xml' in the code above. That's an XML file I've dummied up which contains your <span> elements at different levels of the XML hierarchy as well as a number of edge cases. Here it is:

      <root> <A> <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f +alse">%PN1%</span> <span color="#231f20" whatever="%DoNotMatch%" textOverprint="f +alse"> %PN2% </span> </A> <B> <C> <span color="#231f20" textOverprint="false">%DIMMM%%DIMINC +H%</span> <span color="#231f20" textOverprint="false"> %DIMMM% %DIMINCH% </span> <span color="#231f20" textOverprint="false">%DIMMM%garbage +%DIMINCH%</span> <span color="#231f20" textOverprint="false">%DIMMM%%%DIMIN +CH%</span> <span color="#231f20" textOverprint="false">%DIMMM%%%%DIMI +NCH%</span> </C> </B> </root>

      Here's the output from the script I've shown:

      PN1 PN2 DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH DIMMM DIMINCH

      It's possible you'll need more information than that for your report. In the spoiler below, you'll find a more involved for loop and more verbose output.

      With this for loop:

      for my $context ($doc->findnodes('//span')) { print $context; for my $text ($context->findnodes('text()')) { print $text; while ($text =~ /$re/g) { print $1; } } }

      You'll get this output:

      <span color="#231f20" whatever="%DoNotMatch%" textOverprint="false">%P +N1%</span> %PN1% PN1 <span color="#231f20" whatever="%DoNotMatch%" textOverprint="false"> %PN2% </span> %PN2% PN2 <span color="#231f20" textOverprint="false">%DIMMM%%DIMINCH%</span> %DIMMM%%DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false"> %DIMMM% %DIMINCH% </span> %DIMMM% %DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false">%DIMMM%garbage%DIMINCH%</s +pan> %DIMMM%garbage%DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false">%DIMMM%%%DIMINCH%</span> %DIMMM%%%DIMINCH% DIMMM DIMINCH <span color="#231f20" textOverprint="false">%DIMMM%%%%DIMINCH%</span> %DIMMM%%%%DIMINCH% DIMMM DIMINCH

      The XML parser I've used is XML::LibXML. I like this one because it's both handy for small demo scripts, such as I have here, and also suited to full-blown, commercial applications, where I've used it often. There's lots of others available on CPAN: pick one that suits you.

      You'll probably also want to look at "XML Path Language (XPath) 3.1". That's a lengthy, W3C specification: I rarely need to reference more than the "3.3.5 Abbreviated Syntax" section.

      — Ken