Re^2: Dynamic Regular Expression

Replies are listed 'Best First'.
Re^3: Dynamic Regular Expression by Marshall (Canon) on Jan 06, 2021 at 23:12 UTC
Since you are expressing interest, below is some code to demo the idea. I didn't worry about writing "the best" regex, I just typed in something that I knew would work on the first shot. An advantage of this type of approach is that you can focus on the capturing regex and make sure that it really is capturing what you want. Also the number of "valid terms" (size of the hash) doesn't matter performance wise. If you want case insensitive matching, just put the upper (or lower) case token in the hash and force uc() or lc() before doing the lookup. As I said before, dynamic regex is a cool technique. I have one program that tests whether B is "close enough" to A to be considered a "match". I analyze A and build a regex with anywhere from 3 to a dozen+ terms in it depending upon some complicated rules. The matching rules were so complex that I couldn't figure out a single general purpose regex that would work for any A, but with a specific A, I could write a program to control exactly a specific regex for that particular A value. This worked out far better than these algorithms that generate a number 0-1 as a measure of "closeness". But writing, testing, benchmarking code like that is non-trivial. That is an example of "making something very hard, possible". I think that you will find that the hash lookup is so fast that it doesn't matter. Update: I guess all of this is a long winded way of saying that I don't think that a dynamic regex is the right tool for your job. That is different than saying that it won't work. It will work. But that injects a lot of complexity into the code which means more ways to go wrong and harder to test. use strict; use warnings; # init the "valid hash" # in real program, probably K1 etc probably # come from some configuration file my %hash = map{$_ => 1}qw (K1 ASC GEB N1); # I use a simple fixed regex to pick up first XXX_YYY # in the line. If there are multiple of these things # per line, then there are a couple of ways to handle # that. I'd have to see some actual real data to make # a recommendation. Note that \w includes the underscore, # so I did something simple to exclude the _ character. # To check if match worked, you can put whole regex # line in the "if" clause, or just check "definedness" # like I do below. If $tok is defined in the hash, # then it is a "good" one. while (my $line = <DATA>) { my ($tok,$value) = $line =~ m/([A-Za-z0-9]+)_([A-Za-z0-9]+)/; if ( defined $value and defined $hash{$tok}) { print "Valid TOKEN/VALUE pairs: $tok $value\n"; } } =prints Valid TOKEN/VALUE pairs: K1 XXX Valid TOKEN/VALUE pairs: GEB ZZZ =cut __DATA__ 32 K1_XXX bbb _xxx ASC_ccc STUFF_YYY; M1_bbb; GEB2_NO GEB_ZZZ [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Dynamic Regular Expression
by Marshall (Canon) on Jan 06, 2021 at 23:12 UTC

An advantage of this type of approach is that you can focus on the capturing regex and make sure that it really is capturing what you want. Also the number of "valid terms" (size of the hash) doesn't matter performance wise. If you want case insensitive matching, just put the upper (or lower) case token in the hash and force uc() or lc() before doing the lookup.

As I said before, dynamic regex is a cool technique. I have one program that tests whether B is "close enough" to A to be considered a "match". I analyze A and build a regex with anywhere from 3 to a dozen+ terms in it depending upon some complicated rules. The matching rules were so complex that I couldn't figure out a single general purpose regex that would work for any A, but with a specific A, I could write a program to control exactly a specific regex for that particular A value. This worked out far better than these algorithms that generate a number 0-1 as a measure of "closeness". But writing, testing, benchmarking code like that is non-trivial. That is an example of "making something very hard, possible".

I think that you will find that the hash lookup is so fast that it doesn't matter.

Update: I guess all of this is a long winded way of saying that I don't think that a dynamic regex is the right tool for your job. That is different than saying that it won't work. It will work. But that injects a lot of complexity into the code which means more ways to go wrong and harder to test.

use strict;
use warnings;

# init the "valid hash"
# in real program, probably K1 etc probably
# come from some configuration file

my %hash = map{$_ => 1}qw (K1 ASC GEB N1);

# I use a simple fixed regex to pick up first XXX_YYY
# in the line. If there are multiple of these things
# per line, then there are a couple of ways to handle
# that. I'd have to see some actual real data to make
# a recommendation. Note that \w includes the underscore,
# so I did something simple to exclude the _ character.

# To check if match worked, you can put whole regex
# line in the "if" clause, or just check "definedness"
# like I do below. If $tok is defined in the hash,
# then it is a "good" one. 

while (my $line = <DATA>)
{
   my ($tok,$value) = $line =~ m/([A-Za-z0-9]+)_([A-Za-z0-9]+)/;
   
   if ( defined $value and defined $hash{$tok})
   {
      print "Valid TOKEN/VALUE pairs: $tok $value\n";
   }
}   

=prints
Valid TOKEN/VALUE pairs: K1 XXX
Valid TOKEN/VALUE pairs: GEB ZZZ
=cut

__DATA__
 32  K1_XXX bbb  _xxx  ASC_ccc
STUFF_YYY;
   M1_bbb;
GEB2_NO
GEB_ZZZ
[download]

[reply]
[d/l]