in reply to regex - need first child of parent

As multiple monks smarter than I have stated, you should really be using one of the many technologies widely and freely available to parse XML. Really. No, really. That having been said, assuming there is some good reason to do this that escapes me and my brethren, the following regex will return the first warning child element of a para0 element. I've included all of your listed child tags and those you mentioned in your post.

/<para0[^>]*?> (?: \s* (?: <title .*?<\/title> |<para .*?<\/para> |<applic .*?<\/applic> |<capgrp .*?<\/capgrp> |<subpara1 .*?<\/subpara1> |<caution .*?<\/caution> ) )*? \s* (<warning .*? <\/warning>) (?: \s* (?: <title .*? <\/title> |<para .*? <\/para> |<applic .*? <\/applic> |<capgrp .*? <\/capgrp> |<subpara1 .*? <\/subpara1> |<caution .*? <\/caution> |<warning .*? <\/warning> ) )*? \s* <\/para0> /sx

The code works as follows:

  1. Find an element starting with <para0>, which may have attributes
  2. Non-grouping match any number of title, para, applic, capgrp, subpara1, or caution tags
  3. Match and capture your warning tag
  4. Non-grouping match any number of title, para, applic, capgrp, subpara1, caution or warning tags
  5. Close the search with the closing </para> tag

Note that your entire XML must be in a single string (not an array) and should be executed with the /s modifier.

Update: Some additional notes. This assumes that your XML is well-formed. It assumes you have at least one warning element in your para0 element. And most importantly, if there are 1st generation tags which are not accounted for, .*? can jump to unexpected locations, meaning you will not get what you expected. Compare that to ikegami's solution, which will just work.

Update 2: At shmem's suggestion, I added an /x modifier and reformatted the regex to make it (maybe) easier to follow.

Replies are listed 'Best First'.
Re^2: regex - need first child of parent
by kdolan (Initiate) on Mar 11, 2009 at 18:37 UTC
    THANK YOU!

    For the record, as a developer that manipulates XML a lot!!!, I EMPHATICALLY AGREE regex is not the way to go and I'm not happy that I need to rely on regex to solve my problem. Unfortunately, I am in a situation where portions of both XML and SGML documents need to be modified and the modifications must be restricted to those specifically made (e.g., entity references cannot be resolved). The current implementation uses string-based manipulations to ensure this and regexes to retrieve values and identify replacement sections. I simply need to retrieve one more thing and therefore need a regex to do so.

    My hope would be that at some time I'd be able to find a solution that allows us to use XPath, etc. yet still be able to re-generate the *exact* same XML/SGML originally parsed.

    THANKS AGAIN! You are a life-saver!