Re: Need Set-like Regex for Perl and Java (golf OK)
by diotalevi (Canon) on Jun 08, 2004 at 22:37 UTC
|
You don't have to be especially clever to write something maintainable and cross-platform. If you remove the white-space you can probably use this in Java or POSIX or whatever regex engine you care.
/ A000
| A1[01][0-9]
| A12[0-3]
| A[23][0-9]{2}
| A4[0-4][0-9]
| A45[0-9]
| A999
| B000
| B789
| B79[0-9]
| B8[0-7][0-9]
| B88[0-8]
/x
| [reply] [d/l] |
|
|
If this were a Perl-only solution, or if the regex rarely changed, I would code it with a simple naive list, and then pass it once through Regex::Presuf for a little more efficiency. That module then figures out what common prefixes or suffixes can be coded more cleverly.
-- [ e d @ h a l l e y . c c ]
| [reply] |
|
|
Oh that's an even better idea. I did that module's work manually to come up with that regex.
| [reply] |
Re: Need Set-like Regex for Perl and Java (golf OK)
by Enlil (Parson) on Jun 08, 2004 at 22:41 UTC
|
If you don't need a single regex to do all the work then something like the following might prove useful: use strict;
use warnings;
my $string = 'A000 A123 A567 B789 B888 B999 B000 A998 A999';
my $regex = qr/((A|B)(\d{3}))/;
while ( $string =~ /$regex/g ) {
if ( $2 eq 'A' ) {
if ( $3 == 000 or
($3 >= 123 and $3 <= 456) or $3 == 999 ) {
print "$1 valid string\n"; }
else { print "$1 rejected invalid range\n" }
}
elsif ( $2 eq 'B' ) {
if ( $3 == 000 or ($3 >= 789 and $3 <= 888) ) {
print "$1 valid string\n";
}
else { print "$1 rejected invalid range\n" }
}
else { print "$3 seems valid to me\n" }
}
update:
something else along the same lines: while ( $string =~ /((?:A|B)\d{3})/g ) {
if ( $1 eq 'A000' or
( $1 ge 'A123' and $1 le 'A456' ) or
( $1 ge 'A999' and $1 le 'B000' ) or
( $1 ge 'B789' and $1 le 'B888' )
) {
print "$1 matched\n";
}
else { print "$1 did not match\n" }
or if you are going only for regex only solutions according to this the following regex should be valid in both:
/(A(?:000|(?:1(?:2[0-3]|[01]\d))|[23]\d{2}|4(?:[0-4]\d|5[0-6])|999)|B(
+?:000|789|79\d|(?:8[0-7]\d|88[0-8]+)))/
-enlil | [reply] [d/l] [select] |
Re: Need Set-like Regex for Perl and Java (golf OK)
by Zaxo (Archbishop) on Jun 09, 2004 at 05:06 UTC
|
I'm not much in favor of regexes to solve problems like this in one swell foop. So long as the number of elements is sane, I'd build a hash and check for existence of a key. A regex is used to recognise ranged data and parse the boundary elements:
#!/usr/bin/perl
my $set = [qw/A000 A123-A456 A999-B000 B789-B888/];
my %elements;
for (@$set) {
/-/ or $elements{$_} = undef or next;
if ( /([[:alpha:]]\d{3})-([[:alpha:]]\d{3})/ ) {
my $start = $1;
do { $elements{$start} = undef }
while $start++ ne $2;
}
}
while (<DATA>) {
chomp;
print "$_ is ",
(exists $elements{$_} ? '' : 'not '),
'in the set.',
$/;
}
__DATA__
A000
A001
A122
A123
A124
A154
A320
A455
A456
A457
A530
A779
A932
A998
A999
B000
B001
B123
B666
B788
B789
B790
B820
B887
B888
B889
B900
That's a typical trade of memory for speed. The loop populating the %elements hash makes use of magical string increment to roll A over to B.
I should call attention to the difference between
do { $elements{$start} = undef } while $start++ ne $2;
and
$elements{$start} = undef while $start++ ne $2;
The latter will test and increment $start before any processing is done, while the do {} while cond; construction does the first pass unconditionally.
There is also a tricky bit of logic in the single element handling. . . . or next; is used to go to the next iteration unconditionally because the preceeding expression is always false.
| [reply] [d/l] |
|
|
You could initialize your %elements has quite a bit more easily:
my @set = ('A000','A123'..'A456','A999'..'B000','B789'..'B888');
my %elements;
@elements{@set} = (undef)x@set;
The PerlMonk tr/// Advocate
| [reply] [d/l] |
Re: Need Set-like Regex for Perl and Java (golf OK)
by BrowserUk (Patriarch) on Jun 09, 2004 at 05:48 UTC
|
There should be no problem coding this in Java as it only use the most elementary regex and a long string. It may even be a bit quicker than a large regex alternation.
#! perl -slw
use strict;
my $input = 'A000,A123-A456,A999-B000,B789-B888';
my $set = join'', map{
m[ ( [A-Z] \d{3} ) (?:- ( [A-Z] \d{3} ) )?]x;
$1 .. $2||$1
} split ',', $input;
for my $testVal ( 'A100', 'A234', 'B000',
'B789 does match', 'B788 but not this' ){
## The test
if( $testVal =~ m[([A-Z]\d{3})] and 1+index( $set, $1 ) ) {
print "'$testVal' contains a match for '$input'";
}
else {
print "No match for '$testVal'";
}
}
__END__
P:\test>362578
No match for 'A100'
'A234' contains a match for 'A000,A123-A456,A999-B000,B789-B888'
'B000' contains a match for 'A000,A123-A456,A999-B000,B789-B888'
'B789 does match' contains a match for 'A000,A123-A456,A999-B000,B789-
+B888'
No match for 'B788 but not this'
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
Re: Need Set-like Regex for Perl and Java (golf OK)
by BrowserUk (Patriarch) on Jun 08, 2004 at 22:57 UTC
|
I'm way out of date with Java, but it seems much more likely that it will have a class available that would been sufficiently equivalent to perls hashes for this purpose (Set, Map or Bag spring to mind), than it will have equivalent features in it's regex engine to perl's (?{...}) (??{...}) or (?()..|...) extensions.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
Re: Need Set-like Regex for Perl and Java (golf OK)
by davidj (Priest) on Jun 08, 2004 at 22:42 UTC
|
I'm a little fuzzy on exactly what it is you want to match. In the example set you gave
[A000,A123-A456,A999-B000,B789-B888]
are you looking to match A000 and A456-A999, but NOT A999-B000? If you want the second one (A456-A999) matched, do you want it matched as A456 and A999 or as A456-A999?
davidj | [reply] [d/l] |
|
|
| [reply] [d/l] |
Regex is possible but I prefer a hash.
by Anonymous Monk on Jun 09, 2004 at 23:31 UTC
|
What does "golf OK" mean?
I would not accomplish this via a regexp because using a hash is much easier. It is not an overly complex regex, but it takes patience and attention to detail to maintain. Assuming we want to match A000, A123-A456, A999-B000, B789-B888 inclusively, I would run the following code with:
prompt> matchrange.pl | more
#!/usr/bin/perl
use warnings;
use strict;
foreach ('A000' .. 'B999') {
print "$_ ";
if (/^(?:
A(?:000|12[3-9]|1[3-9]\d|[2-3]\d\d|4[0-4]\d|45[0-6]|999)
|
B(?:000|789|79\d|8[0-7]\d|88[0-8])
)$/x) {
print "match\n";
} else {
print "does not match\n";
}
}
The code could be optimized more by adding grouping non-remembering parentheses for A around the 100 section and "factoring" out the 1 and around the 400 section and "factoring" out the 4 and for B around the 700s and 800s. I decided not to do this because it would hamper maintainability and comprehensibility.
Furthermore, each alternative should be inserted left to right from largest range covered to smallest range. Hence, for A the 000 and 999 should come last; whereas the 200-399 range should come first. However, this makes the regex harder to read. I decided to keep the alternates in numerical order.
I would use a hash if the dataset were not too big (large being relative to the amount of RAM you have):
my %match;
my @num = ('A000','A123'..'A456','A999','B000','B789'..'B888');
@match{ @num } = ();
foreach my $key('A000' .. 'B999') {
print "$key ";
if ( exists $match{$key} ) {
print "match\n";
} else {
print "does not match\n";
}
}
| [reply] [d/l] [select] |
|
|
What does "golf OK" mean?
It means it's OK to come up with a golfed solution.
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
|
|
The hash is a nice solution if the number and size of the ranges aren't to big, but it could consume a substantial amount of memory.
My "big string" method will be substantially slower, but it would use considerably less memory than a hash.
Also, Java's hash equivalent aren't nearly so convenient as perl's, but the big string would translate directly.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] |