Pattern Matching

davidas has asked for the wisdom of the Perl Monks concerning the following question:

I thought I had a reasonable handle on regexes but occasionally problems shatter my confidence. I am trying to match Roman Numerals in the range i to xxxix. The numerals maybe preceded by a left bracket and maybe followed by i. a period (ii) a right bracket or (iii). a right bracket followed by a period. The entire pattern is always terminated with a space.
It all appears to work ok except when the string being searched comprises just a single space, whence it is (incorrectly IMHO) matched.
The part of the pattern to the left of the cluster that contains the space that is matched (ie the roman numeral cluster) has a quantifier of {1,1}, so I really don't undertand, if there are no valid characters in the string before the space, why the space should be matched.
Any help would be greatly appreciated.
This code outputs Matched ' ' in ' '



use strict;
#use re 'debug';

{
  my ($rv, $linestr, $pattern);
  
  $linestr = ' ';
  $pattern = '^\({0,1}(((ix)|(iv))|(x{0,3}((ix)|(iv)))|(x{0,3}(v{0,1}i
+{0,3}))){1,1}((\)\. )|(\) )|(\. )|( )){1,1}';

 if( $linestr =~ m/$pattern/i)
  {
    print ("Matched '$&' in '$linestr'\n");
  }
  else
  {
      print ("Not matched\n");
  }
}
[download]

Comment on Pattern Matching Download Code

Replies are listed 'Best First'.
Re: Pattern Matching by AnomalousMonk (Archbishop) on Mar 17, 2017 at 03:49 UTC
huck has shown why your regex (correctly!) matches a single space. Here's the approach I would take to a solution. (Note that I am sure there are CPAN modules to do all this much better!) Pay particular attention to adding test cases to the no-match section of tests. I would also add some mixed-case tests to the all-match section. c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; use Test::NoWarnings; ;; my $rx_1_3 = qr{ (?i) i{1,3} }xms; my $rx_1_9 = qr{ (?i) (?: $rx_1_3 \| iv \| v $rx_1_3? \| ix) }xms; my $rx_1_39 = qr{ (?i) (?: $rx_1_9 \| x{1,3} $rx_1_9?) }xms; ;; my $pat = qr{ [(]? \b $rx_1_39 \b (?: [.] \| [)][.]?) [ ] }xms; ;; use constant ROMAN_1_39 => qw( i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi xvii xviii xix xx xxi xxii xxiii xxiv xxv xxvi xxvii xxviii xxix xxx xxxi xxxii xxxiii xxxiv xxxv xxxvi xxxvii xxxviii xxxix ); ;; note 'perl version: ', $]; ;; my $test_regex = qr{ \A $pat \z }xms; note 'test regex: ', $test_regex; ;; note 'ALL must match'; for my $roman (ROMAN_1_39, map uc, ROMAN_1_39) { for my $pre ('', '(') { for my $post (qw/. ) )./) { my $rs = qq{$pre$roman$post }; ok $rs =~ $test_regex, qq{'$rs'}; } } } ;; note 'NONE shall pass!'; for my $nomatch (ROMAN_1_39, ' ', qw(iiii ixxxix xxxixi ixxxixi etc), ) { ok $nomatch !~ $test_regex, qq{'$nomatch'}; } ;; done_testing; " [download] (I won't pollute these sacred spaces with the rather tedious output.) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Pattern Matching by huck (Prior) on Mar 17, 2017 at 04:10 UTC
I liked `[.]`, ill have to remember that one!	[reply] [d/l]
Re^3: Pattern Matching by AnomalousMonk (Archbishop) on Mar 17, 2017 at 05:45 UTC
IIRC, this is from TheDamian's regex PBPs, which I religiously observe (the others, not so much). Things like `[(] [.] [ ]` are visually useful — especially `[ ]`, because what the heck does `\` mean anyway in `/x` context, which we ought always to use? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: Pattern Matching by vrk (Chaplain) on Mar 17, 2017 at 10:20 UTC
Re^5: Pattern Matching by AnomalousMonk (Archbishop) on Mar 19, 2017 at 00:07 UTC
Some notes below your chosen depth have not been shown here
Re^2: Pattern Matching by davidas (Initiate) on Mar 17, 2017 at 21:41 UTC
Thanks. I'll have a good play about with this. I didn't go the CPAN module route because of the requirement to match leading and trailing brackets, period and spaces, which I thought would be more specific to my particular requirement - ironically that's not what caused the problem though !	[reply]
Re^3: Pattern Matching by AnomalousMonk (Archbishop) on Mar 19, 2017 at 01:06 UTC
... the requirement to match leading and trailing brackets, period and spaces ... I had in mind using a CPAN module only as a source for a regex for dependably recognizing the Roman-numeric part of your string, something along the lines of what Regexp::Common provides. Unfortunately, this module does not seem to support Roman numerals. Ok, then maybe use the Roman-to-decimal conversion functions of Roman or Text::Roman (but I've not used either of these modules and so can't recommend them) or some such to test for the 1 .. 39 range of a Roman sequence extracted with a simple `[ivxIVX]+` capture. The advantage of using such a module is that it is, one presumes, well-tested. (These modules both provide an `isroman()` function that would, one would hope, reject something like ixixixix, but I haven't checked this.) But if you have to do all that, maybe it's better to hand-craft (and test!) your own `[i-xxxix]` regex... Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Pattern Matching by huck (Prior) on Mar 17, 2017 at 00:12 UTC
to line things up `^${0,1} ( ((ix)\|(iv)) \|(x{0,3}((ix)\|(iv))) \|(x{0,3}(v{0,1}i{0,3})) ){1,1}( ($\. ) \|(\)) \|(\. ) \|( ) ){1,1}` [download] `\|(x{0,3}(v{0,1}i{0,3}))` matches nothing (the zeros) 1 times Had to look hard Edit: see also http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression	[reply] [d/l] [select]
Re^2: Pattern Matching by davidas (Initiate) on Mar 17, 2017 at 21:23 UTC
duh... yes, of course it does. Thankyou for pointing out the nearly obvious	[reply]