shemp has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a regex to split up a string of text with the following considerations:
  1. split on whitespace
  2. split on a digit/letter boundary
  3. split on a letter/digit boundary
This works:
my $string = "A BC 1 23DEF45 6"; my @parts = split /(?:\s+|(?<=\d)(?=[A-Z])|(?<=[A-Z])(?=\d))/i, $s +tring; foreach my $part (@parts) { print "part = $part\n"; }
And yields the result of:
part = A
part = BC
part = 1
part = 23
part = DEF
part = 45
part = 6

- Which is exactly what i want.

But, I was wondering if i could shorten the regex (leave the /\s+/ part alone so that it says look-behind for either a digit or letter, and if you find one, look-ahead for the other pattern.
dont know if this is do-able, but i have many uses for such a thing.
thanks much

Replies are listed 'Best First'.
•Re: Regex - unordered lookaround syntax
by merlyn (Sage) on Apr 28, 2003 at 22:57 UTC
    Sounds to me like you just want this, and have overcomplicated it tremendously:
    my $string = "A BC 1 23DEF45 6"; my @parts = $string =~ /([a-zA-Z]+|\d+)/g;
    Split is wrong when it's easier to talk about what you want to keep rather than what you want to throw away. For that, use a m//g in a list context. It looks like what you wanted to keep was any run of digits, or any run of letters. Hence, mine.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      "Split is wrong when it's easier to talk about what you want to keep rather than what you want to throw away." --merlyn

      I had always understood when I needed split, and when I needed match. But my brain kept these two concepts completely separate for a while. Then I had one of those eureka moments when something explained the magical syntax split // to split the string into solo characters. What a weird special-case, I had thought before. Now it seems so logical.

      It makes sense that if s//-/g would insert dashes between each character, and m//g would happily return an array of nothings for each character, that split // should return the array of each character between all those nothings.

      --
      [ e d @ h a l l e y . c c ]

        But have you ever stopped to wonder why you don't get an infinite loop at the first empty space? (This is explained in perlre, but virtually nobody understands the explanation.)
      I think you are correct, i need to leave right now, but that looks much better.
      You always help me with what i consider to be pretty bizarre problems.
      Thanks much.

      BTW: screw intel
Re: Regex - unordered lookaround syntax
by runrig (Abbot) on Apr 28, 2003 at 23:27 UTC
    This loses on your golf requirement, but its better on being more correct in locale-specific environments:
    my $str = " abc def123abc def"; print "[$_]\n" for split / \s+| (?<=[[:alpha:]])(?=[[:digit:]])| (?<=[[:digit:]])(?=[[:alpha:]]) /ix, $str;
    I think doing what you ask is somewhat doable, but it would just be more of a mess. And turning the problem inside-out as merlyn suggests is a much better answer anyway :-)
Re: Regex - unordered lookaround syntax
by The Mad Hatter (Priest) on Apr 28, 2003 at 22:55 UTC
    I am nowhere near being a regex guru and therefore can't answer your question, but I think I've read somewhere that using the /i modifier slows things down considerably. Apparently, using [A-Za-z] is faster than [A-Z] with the /i modifier. Just wanted to point this out in case the code is being run where performance really matters.

      Nah, take a look at that via use re 'debug'. You'll see that /(?i:[A-Z])/ is /[A-Z]/i is /[A-Za-z]/. All three interpret identically. If you run the actual example through it you'll see its the same. I didn't know the answer to this prior to running these through re'debug' so I'm suggesting that great debug tools like this should be used more often especially when making assertions regarding relative performance. In this case all you win is some source code obfuscation since I hold that its easier to look at either (?i:[A-Z]) (which is really nice because it restricts the effects of /i to just that section or a tacked on /i. Having to be extra specific just makes it easier to type another bug.

        I think the assertion came from Jeff Friedl in this case (Mastering Regular Expressions, 1st ed, not sure about second). He says that the i modifier can be up to 20 times slower than a case specific match, if memory serves. I believe this was during Perl 5.4 though and right before a major regex overhaul. Nice to know it's no longer true.
Re: Regex - unordered lookaround syntax
by aquarium (Curate) on Apr 29, 2003 at 01:48 UTC
    i think this may be one of those times when a traversal of the string would be better than a regex...a lot simpler than that regex, and you can code some other cases to split on as well, without getting a headache from the regex. Chris