Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
For those who have read some of my past posts, I freely confess to hastily posting bad regexes. I thought I was decent with them, but then I started rereading Mastering Regular Expressions. Yikes! I still have a lot to learn.

This "Discussion" is primarily aimed at those new to regexes, so my apologies if I seem pedantic.

I've seen a number of regexes posted to Perlmonks that use a dot star (.*) combination to slurp up characters. While it is sometimes not possible to avoid this, Perlmonks should try to avoid it like the Inquisition.

Dot star is often used as a "catch all" in regular expressions. Often you'll see a regex like the following:

$myvar =~ /"(.*)"/;
The intent of this is to capture whatever is inside of parentheses to $1 (this is called backreferencing). However, this fails if $myvar is something like (yes, I know it's a ridiculous example):
$myvar = qq(Tom said "hi. It's me," and Sally replied "get lost, Ovid + is more my style. Doncha think?"); $myvar =~ /"(.*)"/; print $1;
We might expect the final line to print hi. It's me, but it won't. This is because the star quantifier is greedy. It will attempt the larget match that can possibly satisfy the regular expression. What would be printed is
hi. It's me," and Sally replied "get lost, Ovid is more my style. Do +ncha think?
To solve this, many programmers will add a question mark after the quantifier (.*?) which makes the quantifier match as little as possible. This is called lazy or non-greedy matching. Merely adding that little question mark will cause the code to print the hi. It's me, that we were looking for:
$myvar =~ /"(.*?)"/;
So we're all set, right? Wrong. This certainly works better than the greedy dot-star combination. It keeps trying to match until it finds the first quote mark and then tries to get the smallest match that satisfies the regex. Sounds fine, right? Well, no.

There are two problems with both the greedy and lazy version of the dot star: imprecision and tracking.

The imprecision is obvious. The example above shows the imprecision with the greedy version of .*, but the lazy version is more subtle. What happens if you were trying to extract questions in quotes without the trailing question mark? You might think that something like /"(.*?)\?"/ would do the trick. Unfortunately, we get the same result as above because the lazy matching doesn't guarantee the smallest match. It guarantees the smallest match from the first place that a match could possibly begin.

Tracking is another problem with both of them. With the greedy version of dot star, the dot star gobbles up the entire string. The regex engine is then forced to backtrack from the end of the string to find the longest possible match that will satisfy the regex. With the lazy version, the the regex engine is forced to stop after every match to see if the rest of the regex will match (kind of like a lookahead, but with subtle differences. See Mastering Regular Expressions, Second Edition, page 226 for the mechanics of this). With a one-shot regex, these issues are not usually much of a performance hit (though it can be nasty on a complicated regex with no possible match), but if you are forced to iterate over the regex, these "tracking" issues can have quite a performance hit on your program.

The solution is to use a negated character class:

$myvar =~ /"([^"]*)"/;
What's going on there? When a caret "^" is the first character in a character class, it's telling the regex to match anything except what is in the character class. In this case, it is telling it to keep matching anything (including newline) that is not a quote. If there is a possible match, there is no tracking, there's no ambiguity, there's just a straight match. The above regex will match the first quote, capture everything that's not a quote to $1, and then match the end quote. It's fast and precise.

Unfortunately, it's a little more complicated with the "questions in quotes" example. Here's one solution:

$myvar =~ /"((?:[^?"]|\?[^"])*)\?"/;
Which breaks out to:
$myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses [^?"] # Anything that's not a question mark or quote | # or \?[^"] # A question mark not followed by a quote (to a +llow embedded question marks) )* # Zero or more ) # End capture \?"/x; # Followed by a question mark and quote
Yes, it's more work. Yes, it's harder to read. But it works and doesn't have the problems of the dot star.

A potential pitfall: Unless you are using the /s modifier, the .* does not match a newline (\n). A negated character class will happily match a newline if you don't include it. If your target text has them, be sure to include them in the negated character class, if appropriate.

If Monks would like to see more stuff like this, let me know.


In reply to Death to Dot Star! by Ovid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-03-28 15:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found