comment on

I wish I could remember who originally said this: regular expressions may be greedy, but they are not into deferred gratification.

Let's consider what happens when we apply this regex --

^int\s+(\w+)\s*
(?:^\s*Number\sof\sFlaps:\s(\d+)\s*)?
(?:^\s*IP\sAddress:?\s(\S*)\s*)?
[download]

-- to this string:

int A
    Number of Flaps: 6
    IP Address: 2.2.2.2
[download]

The \s* in the first line of the regex eats up any spaces at the end of the first line of input text, plus the newline, plus the indent at the start of the second line. Since the rest of the regex is inside (?: ... )? brackets, the regex can match successfully without backtracking, and that's exactly what it does -- without matching the rest of the input string.

Now let's remove the question marks at the end of the second and third lines of the regex. This forces the regex engine to match the text inside the brackets if the match is to succeed. This in turn forces the regex engine to backtrack when it tries to match that first \s*, so that the \s* matches any trailing space at the end of the first line of text, and one newline, and nothing more. This leaves the ^ and the rest of the second and third lines of the regex to match successfully.

To solve your problem -- to make the last two lines of data optional -- you need to add newlines to the regex, like this:

^int\s+(\w+)\s*\n
(?:^\s*Number\sof\sFlaps:\s(\d+)\s*\n)?
(?:^\s*IP\sAddress:?\s(\S*)\s*\n)?
[download]

In each case, the newline stops the preceding \s* from slurping up the indenting whitespace at the start of the next line. This makes it possible for the text inside the brackets to match without backtracking.

Like most things related to regexes, this is hard to explain. The principle I want to get across is that every term followed by a star will match greedily (i.e. as many times as possible); and, if Perl can make the regex match without backtracking, it will. This is what was happening with your original version of the regex. The regex engine is prepared to backtrack in order to make the whole regex match, but not in order to match optional items. What the added newlines do in our example is to make sure that the regex engine has reached a point in the input text where the optional brackets can match without the need for backtracking.

I don't feel I've explained this very well, so feel free to come back with questions about it -- but I'll be off the Net for a few days, so I can't reply promptly.

Merry Christmas to all.

Markus

In reply to Re: Regex confusion by MarkusLaker
in thread Regex confusion by scottb

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.