Determing what part of a regex matched.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a question that's making me feel like a newbie, but I'm stumped and now seeking wisdom.

I'm trying to transform a lot of text using a steam of tokens. In doing so I have to I tokenize a string using a single regular expression based on concatenation of smaller regular expressions in order to strip the first/next token in the string and pass it to a handler. Here is a simplified version of what I mean...

$regex = '\w+|\d+|\s+|.*?';
$text = 'The world is foo 2!';

while ($text=~s/^$regex//) {
    print "token: $1\n";
}
[download]

What I'm stumped on is the *best way* to determine what part of the regular expression the current token matched -- thereby telling me the type of token and which handler I should pass it to.

I'm trying to refrain from using ?{ } to set a key I can use to call the handler. I've had my share of issues with scripts that have used construct. Doing a second match seems rather inefficient also. I'm hoping there is another solution that I'm just not grasping. Can anyone offer any suggestions that may lift the veil of igornance from my eyes?

Comment on Determing what part of a regex matched. Download Code

Replies are listed 'Best First'.
Re: Determing what part of a regex matched. by Enlil (Parson) on Mar 06, 2003 at 01:09 UTC
`$regex = '(\w+)\|(\d+)\|(\s+)\|([^\w\d\s]+)'; $text = 'The world is foo 2!'; while ( $text=~s/^$regex// ) { print '\w+',$/ if $1; print '\d+',$/ if $2; print '\s+',$/ if $3; print '[^\w\d\s]+',$/ if $4; }` [download] `$regex = '(\w+)\|(\d+)\|(\s+)\|([^\w\d\s]+)'; $text = 'The world is foo 2!'; while ( $text=~s/^$regex// ) { print '\w+',$/ if defined $1; print '\d+',$/ if defined $2; print '\s+',$/ if defined $3; print '[^\w\d\s]+',$/ if defined $4; }` [download] You could do something like the above, where `$1 .. $4` deal represent which set of parens matched. Note that I changed the `.?` to `[^\w\d\s]+`, as I assumed this is what you meant, as `.?` will try to match nothing first (which it will always do and replace it with nothing (as per s/// operation). So your code was probably looping forever. Also the $1 was not printing as there were no capturing parens in your regular expression. update: striked out old code, and updated it as per Abigail-II's suggestion below. -enlil	[reply] [d/l] [select]
Re: Determing what part of a regex matched. by Abigail-II (Bishop) on Mar 06, 2003 at 01:15 UTC
You'd have to used `defined $1` etc, otherwise, it'll fail to deal with `" 0 "`. Abigail	[reply] [d/l] [select]
Re: Re: Determing what part of a regex matched. by Anonymous Monk on Mar 06, 2003 at 03:37 UTC
Yes you're right. Shame on me for write a quick example without running it first. An interesting thought. Wouldn't it be kind of brittle if the regex introduced another () all of the numbers would have to be shifted? My regexs are more complex then these and occasionally have a few parens in them.	[reply]
Re: Determing what part of a regex matched. by Abigail-II (Bishop) on Mar 06, 2003 at 01:03 UTC
I'd use separate regexes. But if you insist on using a single regex, `(?{ })` is the way to go. I don't understand this "I want to do X. I know it can be done using Y or Z, but I don't want to either Y or Z. How do I do X?". What if, like in this case, both Y and Z are good ways to do X? Abigail	[reply] [d/l]
Re: Re: Determing what part of a regex matched. by Anonymous Monk on Mar 06, 2003 at 03:31 UTC
Thanks everyone for your responses. To clarify I didn't say that I will not use Y or Z. If these are my only two hoices I'll pick one and go with it. I was hoping that (and desperately seeking) there is a happy medium that I'm missing because both have their issues. As I understand it using multiple regexs to evalute one string is less efficient then one. Besides having some bad experiences with ?{ } having is marked as "highly experimental, and may be changed or deleted without notice" in the Perldocs doesn't inspire confidence.	[reply]
Re: Determing what part of a regex matched. by Abigail-II (Bishop) on Mar 06, 2003 at 06:57 UTC
As I understand it using multiple regexs to evalute one string is less efficient then one. Well, it would be comparing multiple simple regexes that won't backtrack versus a single more complex one that will often backtrack. So, while it might be less efficient, it won't be as bad as you think it is. Besides, do you really have to worry about this? Are you doing the parsing in a thight loop? Did you benchmark the two alternatives? You didn't show the code of both ways, did you actually try them? Is the rest of your program finished and peephole optimizations are now being called for? Abigail	[reply]
Re: Determing what part of a regex matched. by dws (Chancellor) on Mar 06, 2003 at 05:54 UTC
Destructive tokenization via s/// can get expensive on large strings. An alternative that you might find worth checking out is to use m//gc (the /c "continues" a failing match) along with \G, which anchors to the position of the last succeeding match. See perlop for an example.	[reply]
Re: Determing what part of a regex matched. by tall_man (Parson) on Mar 06, 2003 at 01:02 UTC
The `$+` is handy for this sort of thing. See perlvar. (You need capturing parenthesis on each alternative if you do it that way, though). Here's an example I did once: `push(@wds, $+) while $ln =~ m/(\d[.@]\d) \| (\{[^}]+\}) \| (\d) \| (\w+\[[^\]]*\]) \| (\w+) \| ([\S])/xg;` [download]	[reply] [d/l] [select]
Re: Determing what part of a regex matched. by Abigail-II (Bishop) on Mar 06, 2003 at 01:06 UTC
Why so complicated? If you want to capture what was matched, just put a single set of parens around the entire regex. But I didn't get the impression that that was the question. I think the question was, what part of the regex matched. Abigail	[reply]
Re: Determing what part of a regex matched. by blakem (Monsignor) on Mar 07, 2003 at 10:25 UTC
Here is how I would tokenize it... note that \d is a subset of \w, so any tokenizer that uses both is probably broken. #!/usr/bin/perl -wT use strict; my $text = 'The world is foo 2!'; my (@words,@numbers,@spaces,@others); while((pos($text)\|\|0) ne length($text)) { if ($text =~ /\G([a-zA-Z_]+)/gc) { push @words, $1; # or call whatever handler you want } elsif ($text =~ /\G(\d+)/gc) { push @numbers, $1; } elsif ($text =~ /\G(\s+)/gc) { push @spaces, $1; } elsif ($text =~ /\G([^\w\s]+)/gc) { push @others, $1; } else { warn "tokenizer is broken\n"; } } print "W: @words\n"; print "N: @numbers\n"; print "S: @spaces\n"; print "O: @others\n"; __END__ W: The world is foo N: 2 S: O: ! [download] -Blake	[reply] [d/l]