comment on

Neither backtracking nor captures are really a problem needing to be fixed.

Notice that I never said captures where a "problem". They are an important feature that backtracking tries to address in a way that doesn't loose backwards compatibility with NFA machines.

    * (713) 445-7890
    * 713-445-7890
    * 7134457890
    * 713 445 7890
[download]
Matching this set of expressions requires optional characters which (if you are doing captures) requires backtracking. (Not really, but the implementation gets hairier if we discuss that part.) So to match a phone number, we would need:
m{ ( (?: \( \d\d\d \) \s* ) | \d\d\d (?: -? | \s* ) ) ? \d\d\d (?: - |
+ \s* ) \d\d\d\d ) }x;
[download]

I don't see any backtracking here. The "pointer" always only moves forward. The simplistic C-code example that I posted also has the star operator in it's regex, but that's not what causes it to backtrack(i.e. literally go back and erase a previous match.) A conditional like /\)*/ isn't really backtracking in it's own right, it either matches or doesn't match at all. However, to get the parenthesis in your example _precisely_ right, backreferences are actually needed, which is interesting.

One of the nastiest problems I ever tried to solve was to extract tables of information from text files generated by people at various companies. You have no idea how many weird variations that people can come up with that a person can interpret, but are almost unparseable by computer. Without many of these features, we would not have gotten as far as we did.

Fair enough, and you're right. I can see sometimes in the real world one just has to get a job done in this way and perhaps quickly. However, I should point out that the long term solution in these cases is a more intelligent system. Solving these with regex is just looking at the problem the wrong way. Think of what Google news does. Or Google translator for that matter. They don't, and can't, use regex for these technologies. You need some level of machine intelligence, that views the problem much more abstractly.

Reza.

In reply to Re^4: Perl regex in real life by RezaRob
in thread Perl regex in real life by RezaRob

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.