DeusVult has asked for the wisdom of the Perl Monks concerning the following question:

I am doing a little parsing, and I want to do a multi-pass split on a piece of data. The first split will be on whitespace, on the idea that anything separated by whitespace is guaranteed to be a separate token. However, tokens may also be not separated by whitespace (ooh, the abuse to grammar). For example, if my input were:

"x:= y + z;" <-- gotta love the pascal reference

Then the tokens woule be x, :=, y, +, z, and ;. (Btw, I'm not actually parsing pascal, but I just wanted to use something most people would probably recognize/understand).

But if I split on whitespace, the contents of my array would be:

( "x:=", "y", "+", "z;" )

So I need to run a second split on each element of the array, in such a fashion that it will split the letters/numbers from the punctuation (or at least I think that's what I need to do, but I'm not dedicated to the idea philosophically--if someone comes up with a one pass solution, I'll be perfectly happy to use it).

A few wrinkles
  1. Since this is a parse, I need to preserve the order. "x" must come before ":=", but both must stay before "y".
  2. I need to be able to specify the specific types of punctuation I want to split on. So hello(there) must become "hello", "(", "there", ")", but hello/there/local/bin must remain a single token. If anyone knows how I can pull this off, I will be highly impressed.

Thank you all yet again.

Some people drink from the fountain of knowledge, others just gargle.

Replies are listed 'Best First'.
Re: Splitting inside an array
by kschwab (Vicar) on Feb 05, 2001 at 22:04 UTC
    split() supports this type of thing. From the docs:

    If the PATTERN contains parentheses, additional array elements are created from each matching substring in the delimiter.

    split(/([,-])/, "1-10,20", 3); produces the list value

        (1, '-', 10, ',', 20)

    So...for your example "x := y + z;":

    $_="x:= y + z;" ; for (split(/(:=|\+|;)/)) { print "[$_]\n"; }
    Produces:
    [x] [:=] [ y ] [+] [ z] [;]
Re: Splitting inside an array
by jeroenes (Priest) on Feb 05, 2001 at 22:07 UTC
Re: Splitting inside an array
by chipmunk (Parson) on Feb 05, 2001 at 22:11 UTC
    Here's a very simple approach, that does something like what you want: my @tokens = split /\s*([:=;()+]+)\s*/, $string; However, it really sounds like you want to be using a tokenizer/parser, that starts at the beginning of the string and matches one token at a time, rather than trying to handle the whole string at once. Have you considered Parse::RecDescent, for example?
Re: Splitting inside an array
by kilinrax (Deacon) on Feb 05, 2001 at 22:10 UTC
    This looks like a job for zero-width lookahead and lookbehind assertions:
    #!/usr/bin/perl -w use strict; my $str = "x:= y + z;"; my @tokens = split /\s+|(?<=\w)(?=(?:;|:=))/, $str; print join "\n", @tokens;
    I don't know if you find commented regular expressions helpful, but just in case you do:
    my @tokens = split / \s+ # whitespace | # or (?<=\w) # after a word character (?= # before a ..... (?: # non-backreferencing paranthesis (to pre +vent adding matches to list returned) ; # a semicolon | # or := # colon equals; add more operators to spl +it on after here ) ) /x, $str;
Re: Splitting inside an array
by Anonymous Monk on Feb 06, 2001 at 02:20 UTC
    Well, you can build up a regexp that recognizes all of your tokens... For example,
    $word = '\w+'; $path = '(?:/$word)+';
    and so forth, then have
    $pattern = '(?:$path|$word|$number|$operator)';
    or some such. To grab all the tokens off your data string, you could just do
    my @tokenList = (); while($data =~ /($pattern)/g) { push (@tokenList, $1); }
    which will parse the string left-to-right (maintaining the order you require) pulling off individual tokens and storing them in an array. You don't even need to worry about the whitespace, because the while(//g) {} will ignore whitespace (if it's not part of your token pattern) and just grab off the tokens...

    Hope this helps,
    CJW

Re: Splitting inside an array
by Fastolfe (Vicar) on Feb 06, 2001 at 02:26 UTC
    I might approach this like:
    @parsed = split(/\s+|\b/, $input); # or @parsed = map { split /\b/ } split(/\s+/, $str);
      $ perl -lwe '$_="x:= y + z;"; print "|$_|" for split /\s*\b\s*/' |x| |:=| |y| |+| |z| |;|

      p
        Is that not what was wanted?

        Sorry, I thought you were quoting my post (I used /\s*\b\s*/ first, but then changed it). The only problem with the code there is that I saw it handling this strangely:

        a := "test"; # a|:= "|test|";
        In short, any consecutive "token" characters, even separated by spaces, were caught up as the same thing, and I figured that was undesirable. The solution then seemed to be to start from breaking on spaces, and then break upon word boundaries. That seemed to get the best result, but still won't catch things like a trailing "; as separate items.

        There's always going to be special cases like this though when using a regex to do the work of a real parser...

Re: Splitting inside an array
by lemming (Priest) on Feb 05, 2001 at 22:11 UTC
    As an example from your's This will split on whitespace and some symbols:
    my @array = grep(!/^$/, split(/\s+|(:=)|([+-\/\\])/, $var));
    There should be a cleaner version that doesn't need the grep, but with my split it gives some null fields.