Re: A tidier regex ?
by ikegami (Patriarch) on Sep 14, 2010 at 14:09 UTC
|
What's the goal of the regex, validation, finding a match or extracting data from a line?
For extraction, the following should do:
/\b((?:CDC|DDCSMR|DDCRMR)\w*)/
For matching, it simplifies to
/\b(?:CDC|DDCSMR|DDCRMR)/
| [reply] [d/l] [select] |
|
|
It's for validation. These don't cater for the (underscores). These answers are opening up insight into Perl... I never realised it was so powerful.
| [reply] |
|
|
If you are validating, you should also anchor your regex to start and end of the string (and maybe allow leading and trailing whitespace).
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] |
|
|
\w does match underscore.
| [reply] |
|
|
|
|
Re: A tidier regex ?
by AnomalousMonk (Archbishop) on Sep 14, 2010 at 20:02 UTC
|
Here's an approach I like because it tends to encourage clear definition and easy maintenance of regexes. It can be quite a bit more verbose than a 'one-liner' approach, but can pay dividends when it's 3AM and you're trying to figure out what went wrong.
The code assumes the following 'requirements' as they have emerged (by inference, implication or suggestion) in discussion:
-
'Any number of...' means 'one or more' (hence the + quantifier);
-
An 'alphanumeric character' is an upper case alpha or a digit;
-
The regex is to be used for entry validation, not for sub-string extraction;
-
Any amount of whitespace may precede or follow the entry.
>perl -wMstrict -le
"my $body = qr{ [[:upper:]\d]+ }xms;
my $cdc = qr{ CDC (?: _ $body){2} }xms;
my $smr = qr{ DDCSMR $body }xms;
my $rmr = qr{ DDCRMR $body }xms;
my $valid = qr{ \A \s* (?: $cdc | $smr | $rmr) \s* \z }xms;
while (<>) {
chomp;
last unless $_;
printf qq{'$_' %svalid \n}, m{$valid} ? '' : 'IN';
}
"
CDC_
' CDC_ ' INvalid
CDC_1_ANSND
' CDC_1_ANSND' valid
CDC_ASD_ERTY
' CDC_ASD_ERTY ' valid
XYZZY
'XYZZY' INvalid
DDCRMRA
' DDCRMRA ' valid
DDCSMR
'DDCSMR' INvalid
Note that the two regexes
my $smr = qr{ DDCSMR $body }xms;
my $rmr = qr{ DDCRMR $body }xms;
could be combined to a single regex, e.g. (untested)
my $ddc = qr{ DDC [SR] MR $body }xms;
and the validation regex would simplify somewhat to
my $valid = qr{ \A \s* (?: $cdc | $ddc) \s* \z }xms;
| [reply] [d/l] [select] |
Re: A tidier regex ?
by BrowserUk (Patriarch) on Sep 14, 2010 at 13:09 UTC
|
$s =~ m[CDC(?:_[A-Z0-9]+){2}(?:,\s+DDC[A-Z0-9]+){2}] and print 'Matche
+d';;
The question is, do you need to be quite so specific?
That is, if you reduced the regex to say: m[CDC\w+(?:,\s+\w+){2}], is there the possibility that it could falsely match something else that will appear somewhere in the file?
It's obviously not so thorough, but it may be good enough given your knowledge of what will be in the file.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
BrowserUk, I was thinking along the same lines on the regex, but you beat me to it and you've got better regexes than what I was coming up with.
Assuming that OP is concerned about the formatting (the two underscores in the first 'word' and starting strings for the next two 'words), I think a slight modification of your second regex could work to meet that need. Something like:
m[CDC_\w+?_\w+(?:,\s+DDC(?:S|R)MR\w+){2}]
Of course, I haven't tested that, so I wouldn't be surprised if someone was able to point out problem(s) in that regex. | [reply] [d/l] |
|
|
I don't think I explained the issue very well. The user will input a single word which could be either.
CDC_*_* or
DDC(SMR or RMR)*
So if my limited knowledge or regex is correct the above doesn't match. I tested it . Hence my reply clarifying the issue. Sorry for any confusion caused.
| [reply] |
|
|
Since this is checking user input in the form of a prompt (rather than input from a file) the suggestions you have made would be sufficient. However in your suggestion whether this would cover it
m[CDC\w+(?:,\s+\w+){2}]
How is the DDC.... catered for ?. My regex skills are at the beginner level hence my question. Thanks. | [reply] [d/l] |
|
|
How is the DDC.... catered for ?.
It's allowed for by the \w+, which matches [A-Za-z0-9_], but it is not verified. So, for example it would also match 'CDC_1, ABC, ABC', if there was any possibility of that appearing in your data.
And that is where you will have to apply your knowledge of your data to decide just how specific you have to be to ensure you only match that data you want to match.
You might for instance know that there will be lines similar to CDC_..., ABC..., XYZ... that you mustn't match, in which case, you need to be more specific. Maybe m[CDC\w+(?:,\s+DDC\w+){2}] would satisfy.
But, if the data is coming from a users typing--who are apt to transpose and omit stuff--then maybe you should stick with a fully specified regex. Say
m[CDC(?:_[A-Z0-9]+){2}(?:,\s+DDC[SR]MR[A-Z0-9]+){2}]
Only you can know your full requirements.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: A tidier regex ?
by JavaFan (Canon) on Sep 14, 2010 at 13:51 UTC
|
/\b (?: CDC_ [A-Z0-9]* _ [A-Z0-9]* |
DDC[SR]MR [A-Z0-9]* ) \b/x
Although factoring out the [A-Z0-9]* may be considered. Now, if you know the data cannot contain lowercase letters (or at least, you know you won't get any more matches if you'd first uppercased the string you match against), you could replace the [A-Z0-9]* with [[:alnum:]]*.
| [reply] [d/l] [select] |
Re: A tidier regex ?
by Anonymous Monk on Sep 14, 2010 at 12:53 UTC
|
Sorry code should have been
#!/usr/bin/perl
use warnings;
use strict;
while(<>) {
if(m/\bCDC_([A-Z])+_([A-Z0-9])+\b|\b(DDCSMR|DDCRMR)([A-Z0-9])+\b/) {pr
+int "Found a match $_\n";}
}
| [reply] [d/l] |