Re: Determing what part of a regex matched.
by Enlil (Parson) on Mar 06, 2003 at 01:09 UTC
|
$regex = '(\w+)|(\d+)|(\s+)|([^\w\d\s]+)';
$text = 'The world is foo 2!';
while ( $text=~s/^$regex// ) {
print '\w+',$/ if $1;
print '\d+',$/ if $2;
print '\s+',$/ if $3;
print '[^\w\d\s]+',$/ if $4;
}
$regex = '(\w+)|(\d+)|(\s+)|([^\w\d\s]+)';
$text = 'The world is foo 2!';
while ( $text=~s/^$regex// ) {
print '\w+',$/ if defined $1;
print '\d+',$/ if defined $2;
print '\s+',$/ if defined $3;
print '[^\w\d\s]+',$/ if defined $4;
}
You could do something like the above, where $1 .. $4 deal represent which set of parens matched. Note that I changed the .*? to [^\w\d\s]+, as I assumed this is what you meant, as .*? will try to match nothing first (which it will always do and replace it with nothing (as per s/// operation). So your code was probably looping forever.
Also the $1 was not printing as there were no capturing parens in your regular expression. update: striked out old code, and updated it as per Abigail-II's suggestion below. -enlil | [reply] [d/l] [select] |
|
|
You'd have to used defined $1 etc, otherwise,
it'll fail to deal with " 0 ".
Abigail
| [reply] [d/l] [select] |
|
|
Yes you're right. Shame on me for write a quick example without running it first.
An interesting thought. Wouldn't it be kind of brittle if the regex introduced another () all of the numbers would have to be shifted? My regexs are more complex then these and occasionally have a few parens in them.
| [reply] |
Re: Determing what part of a regex matched.
by Abigail-II (Bishop) on Mar 06, 2003 at 01:03 UTC
|
I'd use separate regexes. But if you insist on using a single
regex, (?{ }) is the way to go. I don't understand
this "I want to do X. I know it can be done using Y or Z, but I
don't want to either Y or Z. How do I do X?". What if, like in
this case, both Y and Z are good ways to do X?
Abigail | [reply] [d/l] |
|
|
Thanks everyone for your responses.
To clarify I didn't say that I will not use Y or Z. If these are my only two hoices I'll pick one and go with it. I was hoping that (and desperately seeking) there is a happy medium that I'm missing because both have their issues. As I understand it using multiple regexs to evalute one string is less efficient then one. Besides having some bad experiences with ?{ } having is marked as "highly experimental, and may be changed or deleted without notice" in the Perldocs doesn't inspire confidence.
| [reply] |
|
|
As
I understand it using multiple regexs to evalute one string is less efficient then one.
Well, it would be comparing multiple simple regexes that won't
backtrack versus a single more complex one that will often
backtrack. So, while it might be less efficient, it won't be
as bad as you think it is. Besides, do you really have to
worry about this? Are you doing the parsing in a thight loop?
Did you benchmark the two alternatives? You didn't show the
code of both ways, did you actually try them? Is the rest of
your program finished and peephole optimizations are now being
called for?
Abigail
| [reply] |
Re: Determing what part of a regex matched.
by dws (Chancellor) on Mar 06, 2003 at 05:54 UTC
|
Destructive tokenization via s/// can get expensive on large strings. An alternative that you might find worth checking out is to use m//gc (the /c "continues" a failing match) along with \G, which anchors to the position of the last succeeding match. See perlop for an example.
| [reply] |
Re: Determing what part of a regex matched.
by tall_man (Parson) on Mar 06, 2003 at 01:02 UTC
|
The $+ is handy for this sort of thing. See perlvar. (You need capturing parenthesis on each alternative if you do it that way, though).
Here's an example I did once:
push(@wds, $+) while
$ln =~ m/(\d[.@]\d) |
(\{[^}]+\}) |
(\d) |
(\w+\[[^\]]*\]) |
(\w+) |
([\S])/xg;
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Determing what part of a regex matched.
by blakem (Monsignor) on Mar 07, 2003 at 10:25 UTC
|
Here is how I would tokenize it... note that \d is a subset of \w, so any tokenizer that uses both is probably broken.
#!/usr/bin/perl -wT
use strict;
my $text = 'The world is foo 2!';
my (@words,@numbers,@spaces,@others);
while((pos($text)||0) ne length($text)) {
if ($text =~ /\G([a-zA-Z_]+)/gc) {
push @words, $1; # or call whatever handler you want
} elsif ($text =~ /\G(\d+)/gc) {
push @numbers, $1;
} elsif ($text =~ /\G(\s+)/gc) {
push @spaces, $1;
} elsif ($text =~ /\G([^\w\s]+)/gc) {
push @others, $1;
} else {
warn "tokenizer is broken\n";
}
}
print "W: @words\n";
print "N: @numbers\n";
print "S: @spaces\n";
print "O: @others\n";
__END__
W: The world is foo
N: 2
S:
O: !
-Blake
| [reply] [d/l] |