pg has asked for the wisdom of the Perl Monks concerning the following question:

I made a tool for myself to process JCL's, yes, those old good mainframe stuffs. One thing costed quite some of my time, is to split strings on blanks, but the tricky part is that, if the blank is with a pair of single quots, you should not split. For example: a b c d should be split into a, and b, and c, and d. But a 'b c' d should be split into a, and 'b c', and d. I did it, but I have to write my own version of split. Is there any suggestion? is there any built-in perl stuff I can use? Maybe we need to extend the perl split func in the future, to accept one more parm, for example a ref to a list of escape characters.
  • Comment on split on spaces, except those within quotes?

Replies are listed 'Best First'.
Re: split on spaces, except those within quotes?
by Kanji (Parson) on Nov 12, 2002 at 02:31 UTC

    No doubt a fancy regex will do the trick, but simpler in my mind would be Text::ParseWords' parse_line()...

    my @chunks = parse_line(' ', 0, $line);

    One thing to note, however, is parse_line() makes no distinction between single and double quotes, which may or may not work for you.

        --k.


Re: split on spaces, except those within quotes?
by cLive ;-) (Prior) on Nov 12, 2002 at 02:40 UTC
    I thought there must be a quick solution, but I could only think of this:
    #!/usr/bin/perl -w use strict; my $string = "a 'b c d' e f 'g h'"; my $tmp=''; my @result = (); for (split /\s+/, $string) { if (/^'/) { $tmp = $_; next; } elsif (/'$/) { push @result, $tmp." $_"; $tmp=''; next; } elsif($tmp) { $tmp.=" $_"; } else { push @result,$_; } } print join "\n", @result;

    Of course, you could use the DBD::CSV module, setting the record delimiter to 'space' and the text quantifier to 'single quote' (spelled for clarity, not for actual use :)...

    But my guess is that would be slower than a purpose designed parser for this very specific case.

    .02

    cLive ;-)

Re: split on spaces, except those within quotes?
by BrowserUk (Patriarch) on Nov 12, 2002 at 03:27 UTC

    I probably deserve hate mail for this one but...

    #! perl -sw use strict; sub tokenize ($) { local $_ = shift; s/(('|").*?\2)/ ($£ = $1) =~ s!\s+!\cA!g; $£ /ge; #!" grep{s/\cA/ /g, $_}split/\s+/; } my @bits = tokenize q/a "b c d" e f 'g h' ijk "l m n " op 'q r s +t' u'v w'x yz/; local $,='|'; print @bits,$/; __END__ c:\test>212174 a|"b c d"|e|f|'g h'|ijk|"l m n "|op|'q r s t'|u'v w'x|yz|

    Nah! You're thinking of Simon Templar, originally played (on UKTV) by Roger Moore and later by Ian Ogilvy

      Well I'm astonished that worked. The last (and first) time I tried to do a re-entrant regex I triggerred some sort of malloc error. I just thought that doing regexes while inside a regex was disallowed or something. Odd.

      __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;

        I assume you're referring to this:

        s/(('|").*?\2)/ ($£ = $1) =~ s!\s+!\cA!g; $£ /ge; #!"

        That isn't re-entrant. The right side of a substitution counts as a string (which in this case, is eval'ed because of the /e,); only the right side counts as a regex.

      This looks way cool. However, it is not coming through in the web browser as something usable. It has odd characters, A with symbols over top in several places, and what looks like a currency symbol in a couple of places. Does anybody have access to the original correct formula?

        I believe that line is supposed to be s/(('|").*?\2)/ ($£ = $1) =~ s!\s+!\cA!g; $£ /ge;. AFAICT, the nonstandard is just supposed to be a scratch variable, so you can replace it with e.g. $a (assuming there's no sort in the call stack) or a lexical of your choosing.

        However, note BrowserUk's words: "I probably deserve hate mail for this one but..." - see e.g. Regexp::Common::delimited or Text::Balanced.

      Why did you use grep instead of map?

      —John

        Good point John. map works just as well and is a better fit. Thanks.


        Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
        Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
        Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
        Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: split on spaces, except those within quotes?
by jryan (Vicar) on Nov 12, 2002 at 05:11 UTC

    Here's a fancy regex for Kanji :)

    Note that I have 2 different versions available below; one that takes into account backslashed quotes within quotes, and another that doesn't.

    # Use this one if you'd like to account for backslashed quotes my @matches = $string =~ / ((?: (?: ' (?: (?>[^\\']*) | \\ . ) ' ) | \\ . | [^\s'\\]* )+) /gx; # This one does not take backslashed quotes into account #/ #((?: # ' [^']* ' # | [^\s']* #)+) #/gx; # because of the [^\s']*, you'll have matches weaved into your data '' @matches = grep{$_}@matches;

    Update: Fixed paste error.

Re: split on spaces, except those within quotes?
by rob_au (Abbot) on Nov 12, 2002 at 03:58 UTC
    While an answer has already been provided, with very good suggestions from Kanji and cLive ;-), it would be remiss for the module Text::xSV written by our very own tilly not to be mentioned. This module provides an excellent interface for reading character separated data where quoted data may include character separators.

    Additional information can also be found in the thread starting here.

     

    perl -e 'print+unpack("N",pack("B32","00000000000000000000000111011101")),"\n"'

Re: split on spaces, except those within quotes?
by Aristotle (Chancellor) on Nov 15, 2002 at 16:14 UTC
    Here's a rather simple regex I came up with for mp3uncue: my @words = /"?((?<!")\S+(?<!")|[^"]+)"?\s*/g; This form is limited in that it does not account for backslashed quotes nor pay attention to single quotes, but it hardly backtracks and is pretty tidy. For simple tasks, it is nice to have.

    Makeshifts last the longest.

Re: split on spaces, except those within quotes?
by shaq the foo (Initiate) on Nov 12, 2002 at 18:17 UTC
    I would use a negative look ahead: my @arr = split /\s(?!\w+')/, $string;