cmm7825 has asked for the wisdom of the Perl Monks concerning the following question:

I've got huge log files with about 20 fields that I want to split up in to variables using regex. If a field has text it'll be surrounded in quotes like so:

"some data here"

In order to pull the data out of the field I'd use a regex like this: /"([^"]+)/

Which gives me:

some data here

Thats exactly what I want. The problem is if there is no data in that field there is simply a hyphen (-). So I want my regex to either match the double-quotes or the hyphen, when I do this I end up with nested parentheses which throws off my match variables, like this

(("[^"]+)|(-))

I'm not sure if my post made sense, but basically I want to match the data within the quotes, and if there are no quotes, match the hyphen. Either way I want that data to be in the same match variable. Thanks, Chris

Replies are listed 'Best First'.
Re: Regex: Matching quoted text
by kennethk (Abbot) on Jun 03, 2010 at 21:04 UTC
    After this obligatory warning of the hazards and potential fragility of regular expressions, you can fix your case fairly trivially by moving your hyphen into your capturing parentheses:

    /("[^"]+|-)/

    You can also use non-capturing parentheses ((?:...)) with expressions of the form:

    /((?:"[^"]+)|(?:-))/.

    You also might want to add a close double quote to your regular expression, so you don't match incorrectly on "A string" not a string "a string", a la

    /("[^"]+"|-)/

    See perlretut for more information.

Re: Regex: Matching quoted text
by JavaFan (Canon) on Jun 03, 2010 at 21:28 UTC
    Assuming you are running 5.10.0, 5.10.1, 5.12.0, or 5.12.1, use
    /(?|"([^"]+)"|(-))/
    that should put the match in $1, regardless which alternation matched.

    If you are using a pre-5.10 perl, consider this a good reason to upgrade.

      Ugh........I'm using CentOS on this server and even at the latest relase they only have 5.8 in the repos. Is there anyway to do a similar regex without using that?

        Not if you want the match to show up in $1, regardless of which alternation matched. If it was possible beforehand, 5.10 would not have introduced the (?|) construct.
Re: Regex: Matching quoted text
by afoken (Chancellor) on Jun 04, 2010 at 13:54 UTC

    Maybe Text::Balanced could help you? Or one of the log file handling modules available on CPAN?

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Regex: Matching quoted text
by choroba (Cardinal) on Jun 03, 2010 at 21:16 UTC
    If you use /^(?:"([^"]+)"|(-))$/, the matching expression is not always in the same variable, but it is always in $1 || $2 :)

    Update: As tye wrote in his reply, it's $2 || $1.

      If you use /^(?:"([^"]+)"|(-))$/, the matching expression is not always in the same variable, but it is always in $1 || $2 :)
      Eh, no. Consider:
      "0" =~ /^(?:"([^"]+)"|(-))$/;
      Then $1 || $2 is actually undefined (because $1 is 0, hence false, and $2 is undefined). If the OP uses 5.10 or later, he could use $1 // $2, but then there's a better solution available as well (see my other post).
Re: Regex: Matching quoted text
by Marshall (Canon) on Jun 06, 2010 at 09:42 UTC
    This regex could be tweaked a bit so that it doesn't "throw" undefined values into @tokens. But this appears to do what you want in a very straightforward way.
    #!/usr/bin/perl -w use strict; while (<DATA>) { print "input: $_"; my @tokens = grep{defined($_)}m/"(.*?)"|(\S+)/g; foreach my $tok (@tokens) { print " $tok\n"; } } #or while (<DATA>) { print "input: $_"; my @tokens = m/"(.*?)"|(\S+)/g; foreach my $tok (@tokens) { next unless defined($tok); print " $tok\n"; } } =prints input: - "adf adf" - "something else" - adf adf - something else input: "another line" - - "adf" another line - - adf input: abc = 24 "adf adf" abc = 24 adf adf input: -56 ouy "97 lkh wer" - 87 -98 -56 ouy 97 lkh wer - 87 -98 =cut __DATA__ - "adf adf" - "something else" "another line" - - "adf" abc = 24 "adf adf" -56 ouy "97 lkh wer" - 87 -98