comment on

Neither backtracking nor captures are really a problem needing to be fixed. Features like (?>) were added to help extend the sorts of things regular expressions can do.

To take your example, regular expressions are not the problem with why the wrong thing was matched. Your expression allowed that interpretation (or it would have with some minor changes). The problem is that you are using an expression that does not properly cover the case you claim to be looking for.

To take another example that I think shows why these features are more useful, let's match a US telephone number. A telephone number in the US can take many forms:

445-7890
445 7890
4457890
713 445-7890
(713) 445-7890
713-445-7890
7134457890
713 445 7890

And that leaves out adding a 1 or 0 for long distance and extensions, which people often give as part of the number.

Matching this set of expressions requires optional characters which (if you are doing captures) requires backtracking. (Not really, but the implementation gets hairier if we discuss that part.)

So to match a phone number, we would need:

  m{ (
       (?: \( \d\d\d \) \s* ) | \d\d\d (?: -? | \s* ) ) ?
       \d\d\d (?: - | \s* ) \d\d\d\d
     )
   }x;
[download]

Obviously, this appears somewhat complicated and there is quite a bit of possibility for confusion. In this case, however, the problem is not the regex, it's the fact that the phone number format is specified fairly sloppily.

In fact, the times that I have often found the features you are questioning most useful are when I'm dealing with real world data. Because unlike the stuff (insert pompous tone) I generate, the real world is messy and inconsistent.

One of the nastiest problems I ever tried to solve was to extract tables of information from text files generated by people at various companies. You have no idea how many weird variations that people can come up with that a person can interpret, but are almost unparseable by computer. Without many of these features, we would not have gotten as far as we did.

G. Wade

In reply to Re^3: Perl regex in real life by gwadej
in thread Perl regex in real life by RezaRob

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.