rhymejerky has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a line that looks like var1='1' var2="2" var3="3" and so on The ' and " are interchangeable as long as the begin quote and end quote are the same. Like '1' and "2" are valid, but not "1' or '2". Is there a way this line and say it is a valid line? I think the bottom line is figure out what my begin quote is, so I know what to look for in my end quote. I thought about using $1 and $2, but that doesn't seem to work. I see something similar in http://www.perlmonks.org/?node_id=698062, but that one uses 1 set of quote. Any help would be appreciated. Thanks, R

Replies are listed 'Best First'.
Re: matching a line with ' and "
by CountZero (Bishop) on Jul 18, 2008 at 05:40 UTC
    This will check if you have a valid expression of the form ='something' or ="something else":
    m/=(?:"[^"']+")|(?:'[^"']+')/;

    As you see we check for both valid possibilities (hence the use of |). This means that we do not have to save what was the first quote.

    Here is a full explanation (with thanks to YAPE::Regex::Explain):

    The regular expression: (?-imsx:=(?:"[^"']+")|(?:'[^"']+')) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- = '=' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- [^"']+ any character except: '"', ''' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- [^"']+ any character except: '"', ''' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    Please note this does not work with forms such as var1='"1"' or similar.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: matching a line with ' and "
by almut (Canon) on Jul 18, 2008 at 06:50 UTC
    I thought about using $1 and $2, but that doesn't seem to work.

    I'm not saying that this is the preferred solution... but just for general interest: in those cases, you could use \1 and \2 instead, e.g.

    my $s = q(var1='1' var2="2" var3="'3'" var4='"4"'); while ($s =~ m/(\w+)=(["'])(.*?)\2/g) { print "(using quote $2): $1 = $3\n"; }

    Output:

    (using quote '): var1 = 1 (using quote "): var2 = 2 (using quote "): var3 = '3' (using quote '): var4 = "4"

    The ugly thing is that with input such as (incorrect according to your spec)

    my $s = q(var1='1" var2="2');

    it would extract the entire '1" var2="2' substring as one single quoted value...

    You'd have to work around that by disallowing some separator (like whitespace) within the quoted string, or some such, to properly group the assignments (e.g. using the char class [^\s] in place of .). That's problematic if you do need to allow spaces in the quoted values, though.

    Update: fixed/simplified [^\2]+?  —> .*? in the regex, because on second thought, when looking at hipowls's suggestion below, it occurred to me that (a) the backref in the character class doesn't actually work :), (b) even if it did, that part of the regex would've been redundant anyway, due to the non-greedy match...

      No, almut. It might look ugly, but it's the correct output. It might be a typo, but that's what we programmers have to deal with ;-)


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

        Yes, it's correct from a purely syntactic point of view. Still, it's presumably not what the OP had in mind...

        Anyhow, I just wanted to point out that there's a general disambiguation problem, in case the OP needs to allow arbitrary quoted content, which might contain the very separator that's used to split up the individual key-value expressions. It's unclear from the given spec, however, whether that's the case.

Re: matching a line with ' and "
by gaal (Parson) on Jul 18, 2008 at 05:48 UTC
    Have you tried Text::Balanced? Or, if quotes can't be escaped, something like (untested):
    while (my ($var) = $line =~ s/^(\s*['"]+=//) { # eat single var # eat a single quoted value and its trailing whitespace. $line =~ s/^'([^']+)'\s*// || $line =~ s/^"([^"]+)"\s*// || die "invalid"; # or return, whatever you need } # if we haven't consumed all our input, this line is invalid. die "invalid" if length $line;
Re: matching a line with ' and "
by hipowls (Curate) on Jul 18, 2008 at 08:57 UTC

    You could use a character class and a back reference instead of alternation.

    my $string = qq{var1='1' var2="2" var3="3"}; while ( $string =~ /(\w+)=(['"])(.*?)\2/g) { print "$1 = $3\n"; }
    or using perl 5.10
    use 5.010_000; while ( $string =~ /(?<variable>\w+)=(?<delim>['"])(?<value>.*?)\k<del +im>/g) { say "$+{variable} = $+{value}"; }

      Note that when the string includes mixed quoting which is not valid in the OP's terms:

      my $string = qq{var1='1' var2="2" var3="3" var4='4" var5="5' var6="correct"};

      ...hipowls non-5.10 version (5.10 not tested) produces "ugly" output in the sense used by almut and skeeve

      var1 = 1 var2 = 2 var3 = 3 var4 = 4" var5="5 var6 = correct

      More significantly, if $str is in the form:

      my $string = qq{var1='1' var2="2" var3="3" var4='4" var5="5'

      ...the output becomes:

      var1 = 1 var2 = 2 var3 = 3 var4 = 4" var5=

      ...which is the same as the output produced if var5 is properly quoted, var5='5':

      my $string = qq{var1='1' var2="2" var3="3" var4='4" var5='5'};

      i.e.,

      var1 = 1 var2 = 2 var3 = 3 var4 = 4" var5=

      That sort of ambiguity may be a problem for OP.