bdalzell has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to write a regexp match which will pick up a string of 6 to 9 digits in the middle of a longer string.
here is a typical example. It is a line from an online dog show catalog.The top line is the string the two comments are there to provide and explanation
A ||118|AVIANN GILDED WILD HONEY. HM 75081701. 02-04-97 #|--1---|---2--------------------| |---3-----| |--4----| #explanation of fields above
the areas of interest within the line are
(1) stuff to be discarded
(2)the dog's name
(3) the dog's registration number - usually it is 2 alpha characters a space and 8 digits but it could be two alpha characters and 6 digits
(4) the date of birth - 2 digit year
The original catalog entry has some un-needed information associated with 3 sets of tabs at the beginning but I go through the line with a substitution and change the tabs to pipes (|) because it is easier to get rid of them in a regexp (since I can see them).
the dog's name may contain non-alpha characters such as hyphens, single quotes, ampersands and periods so just looking for \w does not work
the registration number is typically of the 2 letter, space, 8 digit formula but sometimes there are typos and there are more or less than 8 digits or it is a foreign registration number. Another typo is having to hyphens (--) between the alphabetic and the numeric part(I am not as worried about the last case).
The date of birth is pretty constant in form.
here is a regexp that works fine if there if the only error is having 7 digits rather than 8 digits in the registration number. It also accepts some of the non-standard foreign registration numbers and the double -- hyphen typoed reg number that looks like this.
HP--09090901.
if( $line=~m/(\|.*)(\|)(.*)(\w{2}.*\d{7,8}\w*).(\s\d\d[\W|-]{1}\d\d[\W|-]{ +1}\d\d)/ ){
I would like it to accept a range of 6 to 9 digits but when I try to substitute {6,9} it does not recognize the input line.
This web page:
http://www.grymoire.com/Unix/Regular.html
suggested:
There is a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting
those two numbers between "\{" and "\}". The backslashes deserve a special discussion. Normally a backslash turns off
the special meaning for a character. A period is matched by a "\." and an asterisk is matched by a "\*".
but \d\{6,9\} does not work for me. I am suspecting that maybe it is not implemented in perl.
I am running perl v5.10.0 built for i486-linux-gnu-thread-multi under Ubuntu Karmic. This is the perl that is a standard installation on Ubuntu Karmic.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: regexp identify variable number of digits within a sentence
by toolic (Bishop) on Aug 31, 2010 at 23:51 UTC | |
|
Re: regexp identify variable number of digits within a sentence
by perlpie (Beadle) on Sep 01, 2010 at 02:44 UTC | |
|
Re: regexp identify variable number of digits within a sentence
by repellent (Priest) on Sep 01, 2010 at 02:13 UTC | |
|
Re: regexp identify variable number of digits within a sentence
by ww (Archbishop) on Sep 01, 2010 at 02:42 UTC | |
|
Re: regexp identify variable number of digits within a sentence
by Marshall (Canon) on Sep 01, 2010 at 06:10 UTC | |
|
Re: regexp identify variable number of digits within a sentence
by aquarium (Curate) on Sep 01, 2010 at 00:51 UTC |