kdolan has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I have a problem and Perl Monks was highly recommended as a community that might be able to help. I'm sorry it is not an actual Perl problem but I'm hoping you will help regardless. My problem is that I need a single (one!) regular expression that I and everyone I know cannot seem to come up with. So here goes...
Sample XML: <para0 id="p00001"> <title>title</title> <warning> <?warningcaution MOD_A ### ID_1?> <para>moda id1 warning</para> </warning> <para>This is a paragraph.</para> <subpara1> <para>This is a paragraph.</para> </subpara1> <subpara1 id="sp00001"> <title>title</title> <warning> <?warningcaution MOD_B ### ID_2?> <para>modb id2 warning</para> </warning> <para>This is a paragraph.</para> </subpara1> <warning> <?warningcaution MOD_C ### ID_3?> <para>modc id3 warning</para> </warning> <caution> <?warningcaution MOD_D ### ID_4?> <para>modd id4 warning</para> </caution> </para0>
The regex needs to return the first warning element directly parented by a para0 element (no grandchildren allowed). A para0 element may parent in order: applic, title, capgrp, warning, ...(other elements)..., subpara1. Each of these children elements are optional and each may parent other elements. With that said, the applic, title, and capgrp elements should never be an ancestor of a warning element themselves. The elements after child warning elements (e.g., other elements, subpara1) may be an ancestor of a warning element.

In the sample above, my expected response is the warning element for MOD_A/ID_1. If warning MOD_A/ID_1 is removed, my expected response is no match.

The regex that I've come up with so far is shown below. I intended it to match a para0 followed by an optional applic, followed by an optional title, followed by an optional capgrp, followed by a warning. It returns my expected response for the sample above but if I remove the para0 child warning element, it matches on the warning MOD_B/ID_2 which is not what I want.

<para0[^>]*>\s*(?:<applic[^>]*>.*?</applic\s*>)?\s*(?:<title[^>]*>.*?< +/title\s*>)?\s*(?:<capgrp[^>]*>.*?</capgrp\s*>)?\s*(<warning[^>]*>\s* +.*?\s*</warning\s*>)
Help!

Replies are listed 'Best First'.
Re: regex - need first child of parent
by talexb (Chancellor) on Mar 11, 2009 at 15:52 UTC

    Does the solution have to be a regex? I would have thought this would be easier to accomplish by using something like XML::Twig to slurp up the XML, and then walking through the resulting data structure and picking out the appropriate node.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: regex - need first child of parent
by kennethk (Abbot) on Mar 11, 2009 at 16:21 UTC
    As multiple monks smarter than I have stated, you should really be using one of the many technologies widely and freely available to parse XML. Really. No, really. That having been said, assuming there is some good reason to do this that escapes me and my brethren, the following regex will return the first warning child element of a para0 element. I've included all of your listed child tags and those you mentioned in your post.

    /<para0[^>]*?> (?: \s* (?: <title .*?<\/title> |<para .*?<\/para> |<applic .*?<\/applic> |<capgrp .*?<\/capgrp> |<subpara1 .*?<\/subpara1> |<caution .*?<\/caution> ) )*? \s* (<warning .*? <\/warning>) (?: \s* (?: <title .*? <\/title> |<para .*? <\/para> |<applic .*? <\/applic> |<capgrp .*? <\/capgrp> |<subpara1 .*? <\/subpara1> |<caution .*? <\/caution> |<warning .*? <\/warning> ) )*? \s* <\/para0> /sx

    The code works as follows:

    1. Find an element starting with <para0>, which may have attributes
    2. Non-grouping match any number of title, para, applic, capgrp, subpara1, or caution tags
    3. Match and capture your warning tag
    4. Non-grouping match any number of title, para, applic, capgrp, subpara1, caution or warning tags
    5. Close the search with the closing </para> tag

    Note that your entire XML must be in a single string (not an array) and should be executed with the /s modifier.

    Update: Some additional notes. This assumes that your XML is well-formed. It assumes you have at least one warning element in your para0 element. And most importantly, if there are 1st generation tags which are not accounted for, .*? can jump to unexpected locations, meaning you will not get what you expected. Compare that to ikegami's solution, which will just work.

    Update 2: At shmem's suggestion, I added an /x modifier and reformatted the regex to make it (maybe) easier to follow.

      THANK YOU!

      For the record, as a developer that manipulates XML a lot!!!, I EMPHATICALLY AGREE regex is not the way to go and I'm not happy that I need to rely on regex to solve my problem. Unfortunately, I am in a situation where portions of both XML and SGML documents need to be modified and the modifications must be restricted to those specifically made (e.g., entity references cannot be resolved). The current implementation uses string-based manipulations to ensure this and regexes to retrieve values and identify replacement sections. I simply need to retrieve one more thing and therefore need a regex to do so.

      My hope would be that at some time I'd be able to find a solution that allows us to use XPath, etc. yet still be able to re-generate the *exact* same XML/SGML originally parsed.

      THANKS AGAIN! You are a life-saver!

Re: regex - need first child of parent
by afoken (Chancellor) on Mar 11, 2009 at 15:54 UTC

    Try parsing the XML instead, using one of the XML modules at CPAN. RegExps don't work too well with hierarchical data formats.

    Alexander

    -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: regex - need first child of parent
by ikegami (Patriarch) on Mar 11, 2009 at 16:08 UTC

    The regex needs to return the first warning element directly parented by a para0 element (no grandchildren allowed).

    As an XPath, that would be

    /descendant::para0/child::warning[1]

    In shorthand:

    //para0/warning[1]
Re: regex - need first child of parent
by ig (Vicar) on Mar 12, 2009 at 17:52 UTC

    Other than "don't do it with REs", I have a couple of thoughts...

    You can "fix" your RE using the highly experimental extended pattern "(?>pattern)". This prevents your patterns matching the non-warning tags spanning into other tags. The modified pattern is:

    <para0[^>]*>\s*(?><applic[^>]*>.*?</applic\s*>)?\s*(?><title[^>]*>.*?< +/title\s*>)?\s*(?><capgrp[^>]*>.*?</capgrp\s*>)?\s*(<warning[^>]*>\s* +.*?\s*</warning\s*>)

    I don't understand your criteria: you say you want to match "the first warning element directly parented by a para0 element". I see two such elements in your sample data:

    <warning> <?warningcaution MOD_A ### ID_1?> <para>moda id1 warning</para> </warning>

    and

    <warning> <?warningcaution MOD_C ### ID_3?> <para>modc id3 warning</para> </warning>

    So, I don't understand why you say "If warning MOD_A/ID_1 is removed, my expected response is no match." instead of matching the second warning directly parented by para0. I am misunderstanding something or you have a contradiction in your requirements.

    Another RE you might consider is the following:

    <para0[^>]*?>(?>\s*<([^ >]*?).*?>.*?</\1(\s[^>]*)?>)*?\s*(<warning>.*? +</warning>)

    This assumes that none of the tags directly under the para0 tag nest within themselves but it does not need to be modified if the set of possible tags changes. Note that this pattern has two capture buffers, so your warning will be in $2 if there is a match.