ngbabu has asked for the wisdom of the Perl Monks concerning the following question:

I am matching the processing instruction: <?CLG.MDFO and <?CLG.MDFC. But not able to match nested. Please see my code and advice me.
while($file=~m/<\?CLG\.MDFO[^>]+?ID="O(.+?)"[^>]+?IDREF="C(.+? +)"[^>]+?>(.+?)<\?CLG\.MDFC[^>]+?ID="C\1"[^>]+?IDREF="O\2"[^>]+?>/msgi +) { my $mdfomdfc=$&; my $ln = line($`); if($mdfomdfc=~m/<\?CLG\.MDFO[^>]+>\n?<([A-Z.]+)[ ]?.+?>\n? +(.+?)\n?<\/(\1)>\n?(<\?.+?\?>)?\n?<\?CLG\.MDFC[^>]+>/msgi) { my $stag = $1; my $etag = $3; if($stag ne $etag) {print FOUT "<p><font color=\"red\">$path:</font> <fon +t color=\"green\">Warning: ".$warn++."</font><font color=\"darkteal\" +>: Line: $ln </font>\&#160;<font color=\"blue\">Check: MDFO tag is pl +aced before &lt;$stag tag, the MDFC tag should close after the same & +lt;$stag tag.</font></p>";} } }
XML Code:
<?CLG.MDFO ID="O001001M004000" IDREF="C001001M004000" ACTION="REPLACED +" LEVEL="STRUCTURE" COMMAND="EXPLICIT" ACTIVE.DOC="306D0754" ACTIVE.L +OC="AR:1;PT:4" MOD.LEVEL="1" PASSIVE.LOC="AR:5"?> <ARTICLE IDENTIFIER="005"> <TI.ART>Article 5</TI.ART> <STI.ART>Recovery of costs</STI.ART> <?CLG.MDFO ID="O002001M003000" IDREF="C002001M003000" ACTION="REPLACED +" LEVEL="STRUCTURE" COMMAND="EXPLICIT" ACTIVE.DOC="308D0162" ACTIVE.L +OC="AR:1;PT:3" MOD.LEVEL="1" PASSIVE.LOC="AR:5;PA:1"?> <PARAG123 IDENTIFIER="005.001"> <NO.PARAG>1.</NO.PARAG> <ALINEA>All costs resulting from issuing the accompanying documents pu +rsuant to Article 2(2) shall be borne by the food business operator r +esponsible for the consignment or its representative.</ALINEA> </PARAG123> <?CLG.MDFC ID="C002001M003000" IDREF="O002001M003000"?> <PARAG IDENTIFIER="005.002"> <NO.PARAG>2.</NO.PARAG> <ALINEA>All costs related to official measures taken by the competent +authorities as regards non-compliant consignments shall be borne by t +he food business operator responsible for the consignment or its repr +esentative.</ALINEA> </PARAG> </ARTICLE> <?no_smark?> <?CLG.MDFC ID="C001001M004000" IDREF="O001001M004000"?>

Replies are listed 'Best First'.
Re: Not able to Matching nested item
by Anonymous Monk on May 31, 2008 at 09:04 UTC
    Use a parser, like XML::Parser
Re: Not able to Matching nested item
by pc88mxer (Vicar) on May 31, 2008 at 14:16 UTC
    Using a real XML parser is a better way to go.

    In your code, I think the problem is the use of [^>]+?> in a couple of places. Try using .*?> instead.

    Update: the above might be the issue, but after further reflection I am not completely sure that will fix it. But read on...

    What you are trying to do is similar to matching nested parentheses, and that's notoriously difficult to do with regular expressions. Moreover, if you run into a CLG.MDFO PI, you need to look at the next element tag, and if you run into a CLG.MDFC PI, you need to parse the preceding element tag.

    I would try this approach which uses simpler regexs and maintains a stack of the parsed PI's:

    my @stack; while ($file =~ m{\G(.*)<\?(.*)\?>}gms) { my $pre = $1; my $pi = $1; my @args = split(' ', $pi); # hopefully this always works my $pi_cmd = uc($args[0]); if ($pi_cmd eq 'CLG.MDFO') { # parse next element tag if ($file =~ m{\G\s*<\s*([^>\s]*?)(.*?)>}gms) { my $element = $1; push(@stack, $element); } } elsif ($pi_cmd eq 'CLG.MDFC') { # parse previous element tag if ($pre =~ m{<\s*([^>\s]*?)([^>]*)>\s*\z}ms) { my $element = $1; unless ($element eq pop(@stack)) { ...emit mismatch warning ... } } } } if (@stack) { ...emit unterminated CLG.MDFO warning... }
    Update: To handle this case:
    </ARTICLE> <?no_smark?> <?CLG.MDFC ID="C001001M00 ...
    youu can modify the above code as follows:
    my $pre_element; my @stack; while ($file =~ m{...regex for a pi...}) { my $pre = $1; ... if ($pi_cmd eq 'CLG.MDFO') { ...same as above... } else { if ($pre =~ m{...regex for element tag...\z}) { $pre_element = $1; } if ($pi_cmd = 'CLG.MDFC') { unless ($pre_element = pop(@stack)) { ... } } } }
    So encountering <? no_smark ?> will set $pre_element for the following CLG.MDFC pi.
Re: Not able to Matching nested item
by pc88mxer (Vicar) on May 31, 2008 at 18:26 UTC
    I didn't have anything better to do on this balmy Saturday afternoon, so I coded up a solution that uses XML::Parser.