Hi, I'm writing some rather large regex's and I've run into a bit of an interesting situation. Through testing I'm able to figure out how to accomplish what I want, but I still don't understand why the regex behaves the way it does. So I'm searching for that answer. I am evaluating the data with regex's compiled with //msgx. Here is some test data I created for this example:
int A Number of Flaps: 6 IP Address: 2.2.2.2 int B Number of Flaps: 8 IP Address 5.5.5.5 int C Number of Flaps: 9 IP Address 9.9.9.9
Now I need for data for each 'int' to be optional, that is the line doesn't have to exist at all and in which case the array returned by evaluating the regex will contain an undef value in it's place. In other words, an interface doesn't neccesarily have to have a line dictating the number of flaps, and doesn't neccesarily have to have an IP address. I'd therefore come up with the following regex:
^int\s+(\w+)\s* (?:^\s*Number\sof\sFlaps:\s(\d+)\s*)? (?:^\s*IP\sAddress:?\s(\S*)\s*)?
However, this doesn't find either of the flaps or the IP address. Interestingly though, in order to 'fix' the regular expression so that it does match, I merely need to remove the caret indicating that the optional component needs to be at the beginning of the line, and it matches, like so:
^int\s+(\w+)\s* (?:\s*Number\sof\sFlaps:\s(\d+)\s*)? (?:\s*IP\sAddress:?\s(\S*)\s*)?
That works. With that, I find all three interfaces and their flaps and IP address. Another way I can make the regex find everything in this example (although it doesn't work for my real data in which the components are optional), is to remove the (?: )? clustering and force it to match that way:
^int\s+(\w+)\s* ^\s*Number\sof\sFlaps:\s(\d+)\s* ^\s*IP\sAddress:?\s(\S*)\s*
Not really relivent in my situation, but interesting while I am attempting to understand the problem.

So the question is, what is it about the caret in my original regex that breaks the matching and makes the optional components not greedy? What do I mean by "not greedy"? Well, consider the following regex with a zero-width positive look-ahead assertion at the end to "pull down" the option components and try to force them to match:

^int\s+(\w+)\s* (?:^\s*Number\sof\sFlaps:\s(\d+)\s*)? (?:^\s*IP\sAddress:?\s(\S*)\s*)? (?=^int|\Z)
That works too. It's the same regular expression before the zero-width positive look-ahead assertion. It surprises and stupifies me that it doesn't work without it because I would expect the optional components to try to match if they can. I mean it's not as if I had used (?: )?? to make it less greedy, but it seems to be behaving that way.

Anyone who can clue me in would be greatly appreciated.
Happy Holidays!

- Scott


In reply to Regex confusion by scottb

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.