A tidier regex ?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: A tidier regex ? by ikegami (Patriarch) on Sep 14, 2010 at 14:09 UTC
What's the goal of the regex, validation, finding a match or extracting data from a line? For extraction, the following should do: `/\b((?:CDC\|DDCSMR\|DDCRMR)\w*)/` [download] For matching, it simplifies to `/\b(?:CDC\|DDCSMR\|DDCRMR)/` [download]	[reply] [d/l] [select]
Re^2: A tidier regex ? by Anonymous Monk on Sep 14, 2010 at 14:18 UTC
It's for validation. These don't cater for the (underscores). These answers are opening up insight into Perl... I never realised it was so powerful.	[reply]
Re^3: A tidier regex ? by moritz (Cardinal) on Sep 14, 2010 at 14:43 UTC
If you are validating, you should also anchor your regex to start and end of the string (and maybe allow leading and trailing whitespace). Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^3: A tidier regex ? by ikegami (Patriarch) on Sep 14, 2010 at 14:39 UTC
\w does match underscore.	[reply]
Re^4: A tidier regex ? by Anonymous Monk on Sep 14, 2010 at 14:58 UTC
Re^5: A tidier regex ? by ikegami (Patriarch) on Sep 14, 2010 at 17:21 UTC
Re: A tidier regex ? by AnomalousMonk (Archbishop) on Sep 14, 2010 at 20:02 UTC
Here's an approach I like because it tends to encourage clear definition and easy maintenance of regexes. It can be quite a bit more verbose than a 'one-liner' approach, but can pay dividends when it's 3AM and you're trying to figure out what went wrong. The code assumes the following 'requirements' as they have emerged (by inference, implication or suggestion) in discussion: 'Any number of...' means 'one or more' (hence the `+` quantifier); An 'alphanumeric character' is an upper case alpha or a digit; The regex is to be used for entry validation, not for sub-string extraction; Any amount of whitespace may precede or follow the entry. `>perl -wMstrict -le "my $body = qr{ [[:upper:]\d]+ }xms; my $cdc = qr{ CDC (?: _ $body){2} }xms; my $smr = qr{ DDCSMR $body }xms; my $rmr = qr{ DDCRMR $body }xms; my $valid = qr{ \A \s* (?: $cdc \| $smr \| $rmr) \s* \z }xms; while (<>) { chomp; last unless $_; printf qq{'$_' %svalid \n}, m{$valid} ? '' : 'IN'; } " CDC_ ' CDC_ ' INvalid CDC_1_ANSND ' CDC_1_ANSND' valid CDC_ASD_ERTY ' CDC_ASD_ERTY ' valid XYZZY 'XYZZY' INvalid DDCRMRA ' DDCRMRA ' valid DDCSMR 'DDCSMR' INvalid` [download] Note that the two regexes `my $smr = qr{ DDCSMR $body }xms;` `my $rmr = qr{ DDCRMR $body }xms;` could be combined to a single regex, e.g. (untested) `my $ddc = qr{ DDC [SR] MR $body }xms;` and the validation regex would simplify somewhat to `my $valid = qr{ \A \s* (?: $cdc \| $ddc) \s* \z }xms;`	[reply] [d/l] [select]
Re: A tidier regex ? by BrowserUk (Patriarch) on Sep 14, 2010 at 13:09 UTC
Would you consider this 'better' `$s =~ m[CDC(?:_[A-Z0-9]+){2}(?:,\s+DDC[A-Z0-9]+){2}] and print 'Matche +d';;` [download] The question is, do you need to be quite so specific? That is, if you reduced the regex to say: `m[CDC\w+(?:,\s+\w+){2}]`, is there the possibility that it could falsely match something else that will appear somewhere in the file? It's obviously not so thorough, but it may be good enough given your knowledge of what will be in the file. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^2: A tidier regex ? by dasgar (Priest) on Sep 14, 2010 at 13:24 UTC
BrowserUk, I was thinking along the same lines on the regex, but you beat me to it and you've got better regexes than what I was coming up with. Assuming that OP is concerned about the formatting (the two underscores in the first 'word' and starting strings for the next two 'words), I think a slight modification of your second regex could work to meet that need. Something like: `m[CDC_\w+?_\w+(?:,\s+DDC(?:S\|R)MR\w+){2}]` [download] Of course, I haven't tested that, so I wouldn't be surprised if someone was able to point out problem(s) in that regex.	[reply] [d/l]
Re^3: A tidier regex ? by Anonymous Monk on Sep 14, 2010 at 13:43 UTC
I don't think I explained the issue very well. The user will input a single word which could be either. CDC__ or DDC(SMR or RMR)* So if my limited knowledge or regex is correct the above doesn't match. I tested it . Hence my reply clarifying the issue. Sorry for any confusion caused.	[reply]
Re^2: A tidier regex ? by Anonymous Monk on Sep 14, 2010 at 13:32 UTC
Since this is checking user input in the form of a prompt (rather than input from a file) the suggestions you have made would be sufficient. However in your suggestion whether this would cover it `m[CDC\w+(?:,\s+\w+){2}]` [download] How is the DDC.... catered for ?. My regex skills are at the beginner level hence my question. Thanks.	[reply] [d/l]
Re^3: A tidier regex ? by BrowserUk (Patriarch) on Sep 14, 2010 at 13:50 UTC
How is the DDC.... catered for ?. It's allowed for by the `\w+`, which matches `[A-Za-z0-9_]`, but it is not verified. So, for example it would also match `'CDC_1, ABC, ABC'`, if there was any possibility of that appearing in your data. And that is where you will have to apply your knowledge of your data to decide just how specific you have to be to ensure you only match that data you want to match. You might for instance know that there will be lines similar to `CDC_..., ABC..., XYZ...` that you mustn't match, in which case, you need to be more specific. Maybe `m[CDC\w+(?:,\s+DDC\w+){2}]` would satisfy. But, if the data is coming from a users typing--who are apt to transpose and omit stuff--then maybe you should stick with a fully specified regex. Say `m[CDC(?:_[A-Z0-9]+){2}(?:,\s+DDC[SR]MR[A-Z0-9]+){2}]` [download] Only you can know your full requirements. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re: A tidier regex ? by JavaFan (Canon) on Sep 14, 2010 at 13:51 UTC
I'd write that as: `/\b (?: CDC_ [A-Z0-9]* _ [A-Z0-9]* \| DDC[SR]MR [A-Z0-9]* ) \b/x` [download] Although factoring out the `[A-Z0-9]` may be considered. Now, if you know the data cannot contain lowercase letters (or at least, you know you won't get any more matches if you'd first uppercased the string you match against), you could replace the `[A-Z0-9]` with `[[:alnum:]]*`.	[reply] [d/l] [select]
Re: A tidier regex ? by Anonymous Monk on Sep 14, 2010 at 12:53 UTC
Sorry code should have been `#!/usr/bin/perl use warnings; use strict; while(<>) { if(m/\bCDC_([A-Z])+_([A-Z0-9])+\b\|\b(DDCSMR\|DDCRMR)([A-Z0-9])+\b/) {pr +int "Found a match $_\n";} }` [download]	[reply] [d/l]