Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Mes Chers Moines,
I have set myself the following task:
  1. given a string which may or may not contain quoted strings within it
  2. count the quotes to make sure that there aren't an odd number
  3. if that test is passed, put the quoted strings in one array
  4. put the regular strings, split on whitespace, in another

So far I have something like this:

$string = 'one two "three" four "five" six'; $count_quotes = $string =~ tr/"/"/; if($count_quotes && ($count_quotes % 2 != 0)){ # there is an odd number of quotes in the string $quotedstringerror = 1; # use this to show user an error message later }elsif($count_quotes){ # there is an even number of quotes in the string $quotedstrings = 1; } if($quotedstrings){ @quotedstrings = $string =~ /"([^"]*?)"/g; # get them all out with m// $string =~ s/"[^"]*?"//g; # remove them from the original with s/// } @unquotedstrings = split(/ +/,$string); # grab remaining regular strings using split print "\@quotedstrings: @quotedstrings \n"; print "\@unquotedstrings: @unquotedstrings \n"; # output: @quotedstrings: three five # @unquotedstrings: one two four six

This seems to work OK, but could I be doing it in a better way?

Do I really have to use tr/// to count them (because m// doesn't return a number of matches), then use m// to get them out (because s/// doesn't push finds onto an array), then use s/// to clean up after myself?

I keep thinking there's something I've missed.
--

“Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
M-J D

Replies are listed 'Best First'.
Re: separated quoted stings from unquoted strings
by steves (Curate) on Feb 15, 2003 at 10:36 UTC

    Text::Balanced can handle all the nuances of quoted or delimited strings, including escapes. Plus it has built in diagnostics for unbalanced quotes, etc.

    use strict; use Text::Balanced qw(extract_quotelike); my $string = 'one two "three" four "five" six'; my ($q, $r, $p); my (@quoted, @unquoted); my ($quoted, $unquoted); $r = $string; while (($q, $r, $p) = extract_quotelike($r, '[^"]*')) { die "$@" if ($@ =~ /match/); $quoted .= " " if (defined($quoted)); $unquoted .= " " if (defined($unquoted)); if (!$q) { $unquoted .= $r; last; } $quoted .= $q; $unquoted .= $p; } @quoted = split(/\s+/, $quoted); @unquoted = split(/\s+/, $unquoted); print "quoted:\n ", join("\n ", @quoted), "\n"; print "unquoted:\n ", join("\n ", @unquoted), "\n";
    produces:
    quoted: "three" "five" unquoted: one two four six

    I've used this package a bit. It's a little clumsy at times and may be overkill here but it may be useful if you need to handle more than the simple cases. If you're looking only for explict types of quotes one of its other methods may be more useful.

Re: separated quoted stings from unquoted strings
by Enlil (Parson) on Feb 15, 2003 at 10:13 UTC
    This works, though not in the same order, and like BrowserUK's solution does not work for embedded quotes. (TMTOWTDI).
    use strict; use warnings; my $string = 'one two "three" four "five" six'; my (@quoted,@unquoted); my $length = 0; while ( $string =~ /\G\s*(?:("[^"]+")|([^"\s]+))(?:\s+)?/g ) { if ( $1 ) { push @quoted,$1 } else { push @unquoted,$2 } $length = pos $string; } unless ( length($string) == $length ) { print "Quotes did not match" } else { local $\ ="\n"; print q(@quoted:), join ",", @quoted; print q(@unquoted:),join ",",@unquoted; }
    And just for fun heres another solution:
    use strict; use warnings; my $string = ' one two "three" four "five" six'; my (@quoted,@unquoted); $\ ="\n"; 1 while ( $string =~ s/"([^"]+)"/push @quoted,$1;''/ge ); if ( $string =~ /"/ ) { print q(A " was found so failure) and exit; } push @unquoted, split " ", $string; print q(@quoted:), join ",", @quoted; print q(@unquoted:),join ",",@unquoted;
    I doubt either solutions are the best solution. I think the best solution would depend on the strings and what you mean by a quoted string (can there be escaped quotes inside the string if so neither solution presented here will work). Well good luck, on your quest for knowledge.

    -enlil

Re: separated quoted stings from unquoted strings
by Coruscate (Sexton) on Feb 15, 2003 at 10:00 UTC

    Most of your answer has been answered. As for having to use tr//, you can use m//:

    ++$counted_quotes while $string =~ /"/g;

    Not sure which one you'd want though, the tr// or that longer m//. One thing that I noticed immediately about your script is a lack of the strict pragma. You might want to add that :)


    Update: I actually posted a reply to answer this, only to find I replied to a different node lol. Anyway, to copy it over (okay, I modified the regex in this version to rid of the terrible horrible abomination (I mean the ".*?"):

    my $string = 'first-item "second item" ' . 'third-item "fourth item"'; my %items; push @{$items{ $1 =~ /"/ ? 'quoted' : 'unquoted' }}, $1 while $string =~ /("[^"]+"|\S+)/g; print 'Quoted: ', join(', ', @{$items{'quoted'}}), "\n"; print 'Unquoted: ', join(', ', @{$items{'unquoted'}}), "\n";


    Credits: theorbtwo for "terrible horrible abomination", which he just posted in the CB, saying he got to use it in a sentence. I couldn't let him be th eonly one to use it :)


    If the above content is missing any vital points or you feel that any of the information is misleading, incorrect or irrelevant, please feel free to downvote the post. At the same time, reply to this node or /msg me to tell me what is wrong with the post, so that I may update the node to the best of my ability. If you do not inform me as to why the post deserved a downvote, your vote does not have any significance and will be disregarded.

Re: separated quoted stings from unquoted strings
by BrowserUk (Patriarch) on Feb 15, 2003 at 05:28 UTC

    Update:Enlil++ points out that the original solution below isn't. I completely missed capturing the first unquoted word in the example string, which translates to missing any and all unquoted strings that are not followed by a quoted string or the end-of-string! Whoops.

    The addition of * at the end of the second regex fixes the problem (I think?).

    $s = 'one seven eight two "three" four "five" "nine" ten eleven six tw +elve'; @q = $S =~ m[" ( [^"]+ ) "]gx; print @q three five nine @u = $S =~ m[(\S+)\s* (?: (?: " [^"]+ " \s*) | $ )* ]gx; print @u one seven eight two four ten eleven six twelve

    Provided your quoted strings don't contain embedded quotes, this should work once you've checked for balanced quotes. $string = 'one two "three" four "five" six'; @quoted = $string =~ m[" ( ^"+ ) "]gx; @unquoted = $string =~ m[(\S+)\s* (?: (?: " ^"+ " \s*) | $ )]gx; print @quoted; print @unquoted; three five two four six


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: separated quoted stings from unquoted strings
by Dr. Mu (Hermit) on Feb 16, 2003 at 00:00 UTC
    Text::ParseWords is also a useful tool here, viz:
    use strict; use Text::ParseWords; my $string = 'one "two and a half" "three" four "five" "six"'; my @words = &parse_line('\s+', 1, $string); my $error = length($string) && @words == 0; my @unquoted = grep {!m/"/} @words; my @quoted = grep {s/^"(.*)"$/$1/} @words; print join('|', @unquoted)."\n"; print join('|', @quoted)."\n"; print $error;
    The second argument to parse_line tells it to retain the quotes it found. This function returns nothing in the case of unbalanced quotes, so we can use this to detect an error. Also note the order of the two greps. The substitute in the second one operates directly on the elements of @words, not on the intermediate $_ (which I have just learned while concocting this example). So it cannot come before the one above it.
Re: separated quoted stings from unquoted strings
by hv (Prior) on Feb 17, 2003 at 11:27 UTC

    I'd be inclined to do it like this:

    my $string = 'one two "three" four "five" six'; my(@quot, @unquot); while ($string =~ m{ \G (?: ([^\s"]+) # $1: unquoted string, never empty | " ([^"]*) " # $2: quoted string, maybe empty ) \s* # consume the separating whitespace }xgc) { if (defined $1) { push @unquot, $1; } else { push @quot, $2; } } unless (pos($string) == length($string)) { warn "Couldn't parse string - maybe the quotes are mismatched\n"; } print "quoted strings: @quot\n"; print "unquoted strings: @unquot\n";

    This keeps the parsing in one place, so you can easily (for example) extend the matching to allow backslash-escaped quotes inside the quoted strings.

    Hugo