Pratikh has asked for the wisdom of the Perl Monks concerning the following question:

I am new to Perl, I need some help. I have to create a regular expression like

^K1_(\w*\d*)$|^ASC_(\w*\d*)$|^GEB_(\w*\d*)$|N1_(\w*\d*)$.
In this expression, the user will provide the nodes(K1,ASC,GEB,N1 etc) in a .txt file.The nodes provided by the user may change,sometimes it can be 4 or 5 or 6 etc. After the regex is formed I have to compare it with a nodeVal and delete the respective nodes. The nodeVal is defined as follows
$nodeVal =~ s/^\s+|\s+$//g; #remove leading and trailing whitespace $nodeVal =~ tr/"//d; #remove quotation marks

Replies are listed 'Best First'.
Re: Dynamic Regular Expression (updated)
by AnomalousMonk (Archbishop) on Dec 09, 2020 at 22:45 UTC

    See also haukex's excellent article Building Regex Alternations Dynamically.

    \w*\d*
    Also note: The quoted sequence appears several times in the OPed regex example. Be aware that \d is a subset of the \w character class. This means that \w* will match any number of letters, _ (underscore) and digits and leave nothing for \d* to match. This will do no harm in the posted regex, but is a small point that should be borne in mind.

    Update: Of perhaps greater significance, also note that \w*\d* will | can match nothing at all.   (s/will/can/ per suggestion of GrandFather.)   (Update: Actually, I think it's even more accurate to say that the pattern \w*\d* will always match something even if it's only the empty string. Of course, other parts of the pattern may fail to match, preventing an overall match, but that's another story. :)


    Give a man a fish:  <%-{-{-{-<

      Going forward, I will keep this in mind. Thank you.
Re: Dynamic Regular Expression
by GrandFather (Saint) on Dec 09, 2020 at 22:21 UTC

    What have you tried? How did it fail? What specific detail are you having trouble with? See I know what I mean. Why don't you?. As a starting point you may find it helps to refactor your expression as:

    use strict; use warnings; my @opts = qw(K1 ASC GEB N1); my $alts = join '|', @opts; my $re = qr{^(?:$alts)_(\w*\d*)$}; print "Matched $1\n" if 'GEB_S1' =~ $re;
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      Thank you for the code snippet and I have added it to my main program and it is working as expected. Thank you for your help.
Re: Dynamic Regular Expression
by jwkrahn (Abbot) on Dec 10, 2020 at 02:19 UTC
    ^K1_(\w*\d*)$|^ASC_(\w*\d*)$|^GEB_(\w*\d*)$|N1_(\w*\d*)$.

    In your example K1, ASC and GEB are anchored at the beginning of the string while N1 is not. How do you determine which strings should match at the beginning of the string?

    Your pattern ends with a . which will match any character except after the end of the string!

      Sorry, It was my mistake. N1 is also anchored. Thank you for noticing the mistake and thank you for your help.
Re: Dynamic Regular Expression
by LanX (Saint) on Dec 09, 2020 at 22:11 UTC
    This might get you started... (demo in "reply")
    18> my @nodes = qw(K1 ASC GEB N1) $res[7] = [ 'K1', 'ASC', 'GEB', 'N1' ] + 19> my $regex = join '|', map { $_.'_(\w*\d*)'} @nodes $res[8] = 'K1_(\\w*\\d*)|ASC_(\\w*\\d*)|GEB_(\\w*\\d*)|N1_(\\w*\\d*)' + 20> map { s/^$regex$//r } qw/drivel K1_bar42 ASC_foo666 whatever/ $res[9] = [ 'drivel', '', '', 'whatever' ] + 21>

    edit

    > I am new to Perl, ...

    see perlintro

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Thank you for your help.
Re: Dynamic Regular Expression
by Marshall (Canon) on Dec 12, 2020 at 18:51 UTC
    This post jumped immediately to a dynamic regex solution to your problem. I have used dynamic regex before and it is a cool technique! However a hash table lookup could also be applicable?

    For this you would put the ASC_,K1_ etc. tokens as keys in a hash table. Use a fixed regex to extract the XXX_YYY patterns. Then lookup if XXX_ exits in the hash table or not. Coding depends upon what your input data looks like.

    Just a thought for you...

      I have not used the hash table lookup technique but it seems to be a good solution. Thank you for your suggestion
        Since you are expressing interest, below is some code to demo the idea. I didn't worry about writing "the best" regex, I just typed in something that I knew would work on the first shot.

        An advantage of this type of approach is that you can focus on the capturing regex and make sure that it really is capturing what you want. Also the number of "valid terms" (size of the hash) doesn't matter performance wise. If you want case insensitive matching, just put the upper (or lower) case token in the hash and force uc() or lc() before doing the lookup.

        As I said before, dynamic regex is a cool technique. I have one program that tests whether B is "close enough" to A to be considered a "match". I analyze A and build a regex with anywhere from 3 to a dozen+ terms in it depending upon some complicated rules. The matching rules were so complex that I couldn't figure out a single general purpose regex that would work for any A, but with a specific A, I could write a program to control exactly a specific regex for that particular A value. This worked out far better than these algorithms that generate a number 0-1 as a measure of "closeness". But writing, testing, benchmarking code like that is non-trivial. That is an example of "making something very hard, possible".

        I think that you will find that the hash lookup is so fast that it doesn't matter.

        Update: I guess all of this is a long winded way of saying that I don't think that a dynamic regex is the right tool for your job. That is different than saying that it won't work. It will work. But that injects a lot of complexity into the code which means more ways to go wrong and harder to test.

        use strict; use warnings; # init the "valid hash" # in real program, probably K1 etc probably # come from some configuration file my %hash = map{$_ => 1}qw (K1 ASC GEB N1); # I use a simple fixed regex to pick up first XXX_YYY # in the line. If there are multiple of these things # per line, then there are a couple of ways to handle # that. I'd have to see some actual real data to make # a recommendation. Note that \w includes the underscore, # so I did something simple to exclude the _ character. # To check if match worked, you can put whole regex # line in the "if" clause, or just check "definedness" # like I do below. If $tok is defined in the hash, # then it is a "good" one. while (my $line = <DATA>) { my ($tok,$value) = $line =~ m/([A-Za-z0-9]+)_([A-Za-z0-9]+)/; if ( defined $value and defined $hash{$tok}) { print "Valid TOKEN/VALUE pairs: $tok $value\n"; } } =prints Valid TOKEN/VALUE pairs: K1 XXX Valid TOKEN/VALUE pairs: GEB ZZZ =cut __DATA__ 32 K1_XXX bbb _xxx ASC_ccc STUFF_YYY; M1_bbb; GEB2_NO GEB_ZZZ