Joy Conner has asked for the wisdom of the Perl Monks concerning the following question:

I am having trouble understanding the following pattern:
/"([^"]*)"/
This matches a double-quote, the contents of a string and a closing quote mark and I don't understand how this works. I'm confused by the (^) carat.

I thought that if a (^) occurs as the first character of a character class, the character class is negated.

Will someone clarify this?

update (broquaint): added formatting

Replies are listed 'Best First'.
Re: Pattern Matching Question
by Fletch (Bishop) on Sep 10, 2003 at 14:43 UTC

    Right, it means:

    • "
    • zero or more of anything but " (which is stored in $1)
    • another "

    Of course you probably want to use something like Text::Balanced instead.

Re: Pattern Matching Question
by Abigail-II (Bishop) on Sep 10, 2003 at 14:58 UTC
    I thought that if a (^) occurs as the first character of a character class, the character class is negated.

    Exactly. So, /[^"]/ matches any character that isn't a double quote. /[^"]*/ matches zero or more characters that are not double quotes, and /"([^"]*)"/ matches a double quote (the starting delimiter), zero or more characters that aren't a double quote (the content), and then a double quote (the ending delimiter). The parens capture the content.

    Abigail

Re: Pattern Matching Question
by antirice (Priest) on Sep 10, 2003 at 19:49 UTC

    This is an excellent time to learn about a module called YAPE::Regex::Explain. With it, you can do the following:

    #!/usr/bin/perl -w use strict; use YAPE::Regex::Explain; my $regex_i_dont_understand = q~"([^"]*)"~; print YAPE::Regex::Explain->new($regex_i_dont_understand)->explain; __DATA__ output: (?-imsx:"([^"]*)") matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    Pretty nifty, eh? Now whenever you don't understand the way a particular regular expression works, just set $regex_i_dont_understand to it and it will explain it piece by piece.

    Hope this helps.

    antirice    
    The first rule of Perl club is - use Perl
    The
    ith rule of Perl club is - follow rule i - 1 for i > 1

Re: Pattern Matching Question
by sweetblood (Prior) on Sep 10, 2003 at 14:55 UTC
    I looks like it is intended to take everything in between the double quotes by capturing what is NOT(^) a double quote. There could be problems with this approach though. For instance if there is a double quote inside the string that is not intended to be a closing quote such as the string "supplied on 5.25" disk". There is probably a better way to extract the string from between the double quotes i.e. /^"(.*)"$/ might do it if the entire string is wrapped in double quotes.

      In English, I can see the string "supplied on 5.25" disk" as valid. But since I'm of a literal mind, the disk and closing quote aren't part of the string.

      So I'd propose an example of what you are talking about as something like "supplied on 5.25\" disk" which is a valid Perl string...

      Not that it really matters, just a slight nitpick.

        The problem is that we don't always get to deal with data we create. In my current work, I deal with POS data from many sources and lots of it. So this, in a way is a real life example. Some of the data I deal with comes in a quote-delimited fashion and I often see things like this.
        You are right if I were creating these data, I would likely use double quotes only to deliniate between alpha and numeric data. Of course that's only if I were limited to using a delimiter that might appear in my data.
        Unfortunitly most commercially available tools for parsing will parse these data incorrectly. Like you they view it as "supplied on 5.25" and the rest just hits the bit bucket. This is where perl comes in handy to prep data that has issues like this.

        Thank goodness and LW for perl!

Re: Pattern Matching Question
by zby (Vicar) on Sep 10, 2003 at 14:56 UTC
    I thought that if a (^) occurs as the first character of a character class, the character class is negated.
    And you were right. [^"]* matches a string of characters different than the double quote. And the whole pattern matches a string of characters different than double quotes inside double quotes.
Re: Pattern Matching Question
by Zaxo (Archbishop) on Sep 10, 2003 at 15:02 UTC

    The regex expression [^"] is a character class, the caret meaning, as you say, negation - 'anything but what follows'. The * after that means matc zero or more of them, greedily. The parentheses capture what's matched in $1. The enclosing quotes are matched literally. The result is that everything between the first and second quote is captured.

    Another way would be to use a non-greedy expression in the capture, /"(.*?)"/

    After Compline,
    Zaxo

      Another way would be to use a non-greedy expression in the capture, /"(.*?)"/

      Uhm, not quite. You'd have to use /"(.*?)"/s. Furthermore, if you would embed the regex in a larger one, "[^"*]" would never match a double quote inside the other ones, while ".*?" may.

      Abigail

Re: Pattern Matching Question
by dsb (Chaplain) on Sep 10, 2003 at 17:51 UTC
    The ^ inside the [] as the first character means anything that is NOT a double quote. The * outside the [] means to match as many non double quotes as possible.

    Hello again, everyone. been awhile :)




    Amel
    This is my cool %SIG
Re: Pattern Matching Question
by Roger (Parson) on Sep 10, 2003 at 23:21 UTC
    Just an alternative of double-quote matching - the following regular expression will match anything wrapped inside a double quote, including the escaped double-quotes.
    $_ = '"Hello \" world!" I am Roger'; ($str) = /("(?:\\"|.)*?")/x; print "$str\n";
    Note the use of ?: tells perl not to remember the inner pattern, which makes it a bit more efficient.
Re: Pattern Matching Question
by idsfa (Vicar) on Sep 11, 2003 at 06:03 UTC
Re: Pattern Matching Question
by bl0rf (Pilgrim) on Sep 11, 2003 at 01:01 UTC
    much ado about nothing