regexp identify variable number of digits within a sentence

bdalzell has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a regexp match which will pick up a string of 6 to 9 digits in the middle of a longer string.

here is a typical example. It is a line from an online dog show catalog.The top line is the string the two comments are there to provide and explanation

  A ||118|AVIANN GILDED WILD HONEY. HM 75081701. 02-04-97
 #|--1---|---2--------------------| |---3-----|  |--4----|
 #explanation of fields above
[download]

the areas of interest within the line are

(1) stuff to be discarded

(2)the dog's name

(3) the dog's registration number - usually it is 2 alpha characters a space and 8 digits but it could be two alpha characters and 6 digits

(4) the date of birth - 2 digit year

The original catalog entry has some un-needed information associated with 3 sets of tabs at the beginning but I go through the line with a substitution and change the tabs to pipes (|) because it is easier to get rid of them in a regexp (since I can see them).

the dog's name may contain non-alpha characters such as hyphens, single quotes, ampersands and periods so just looking for \w does not work

the registration number is typically of the 2 letter, space, 8 digit formula but sometimes there are typos and there are more or less than 8 digits or it is a foreign registration number. Another typo is having to hyphens (--) between the alphabetic and the numeric part(I am not as worried about the last case).

The date of birth is pretty constant in form.

here is a regexp that works fine if there if the only error is having 7 digits rather than 8 digits in the registration number. It also accepts some of the non-standard foreign registration numbers and the double -- hyphen typoed reg number that looks like this.

HP--09090901.

if(
$line=~m/(\|.*)(\|)(.*)(\w{2}.*\d{7,8}\w*).(\s\d\d[\W|-]{1}\d\d[\W|-]{
+1}\d\d)/
){
[download]

I would like it to accept a range of 6 to 9 digits but when I try to substitute {6,9} it does not recognize the input line.

This web page:
http://www.grymoire.com/Unix/Regular.html
suggested:
There is a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting those two numbers between "\{" and "\}". The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. A period is matched by a "\." and an asterisk is matched by a "\*".

but \d\{6,9\} does not work for me. I am suspecting that maybe it is not implemented in perl.

I am running perl v5.10.0 built for i486-linux-gnu-thread-multi under Ubuntu Karmic. This is the perl that is a standard installation on Ubuntu Karmic.

Comment on regexp identify variable number of digits within a sentence Select or Download Code

Replies are listed 'Best First'.
Re: regexp identify variable number of digits within a sentence by toolic (Bishop) on Aug 31, 2010 at 23:51 UTC
I don't really understand your requirements, but maybe this slightly different approach will help: `use strict; use warnings; my $s = ' A \|\|118\|AVIANN GILDED WILD HONEY. HM 75081701. 02-04-97'; if ($s =~ / ^ .* [\|] (.) [.] ([^.]+) [.] ([^.]+) $ /x) { my $name = $1; my $num = $2; my $date = $3; print "Name = $name\n"; print "Num = $num\n"; print "Date = $date\n"; } __END__ Name = AVIANN GILDED WILD HONEY Num = HM 75081701 Date = 02-04-97` [download] Can you show a few more lines of your actual input? Update: The above now accounts for a period in the name. Here is my original code: `if ($s =~ / . [\|] ([^.]+) [.] ([^.]+) [.] (.*) /x) {` [download]	[reply] [d/l] [select]
Re: regexp identify variable number of digits within a sentence by perlpie (Beadle) on Sep 01, 2010 at 02:44 UTC
When crafting regular expressions, it helps to build off of the more stable parts of the input. In this case, it sounds like there is a ton of variability in the initial and middle parts of the input, but not so much at the end. I started from the end. `perlpie$ perl -e ' print "A \|\|118\|AVIANN GILDED WILD HONEY. HM 75081701. 02-04-97" =~ /.+?\.\s+(.+)\.\s+[-\d]+$/ ? "id is [$1]\n" : "id not found\n"; ' id is [HM 75081701]` [download] From the end, there's the `$` to anchor at the tail, then the `\s+[-\d]` to gobble up the date of birth, then `\.\s+` to get the period and whitespace, then the capture which I've made very liberal `(.+)` and then the period space which precedes it `\.\s+` and at the very start a non-greedy `.+?` to slowly move through the input until the rest of the pattern can match. Now you could fool that pretty easily. Just stick a dot space in the dog's name and the bit preceding the match will match too soon and you'll get too much in the ID. `perlpie$ perl -e ' print "A \|\|118\|AVIAN. GILDED WILD HONEY. HM 75081701. 02-04-97" =~ /.+?\.\s+(.+)\.\s+[-\d]+$/ ? "id is [$1]\n" : "id not found\n"; ' id is [GILDED WILD HONEY. HM 75081701]` [download] So, we need to be a bit more rigid about the ID. `perlpie$ perl -e ' print "A \|\|118\|AVIAN. GILDED WILD HONEY. HM 75081701. 02-04-97" =~ /.+?\.\s+(\w\w(?:\s+\|--)\d+)\.\s+[-\d]+$/ ? "id is [$1]\n" : "id not found\n"; ' id is [HM 75081701]` [download] That works. We're now using `\w\w(?:\s+\|--)\d+` which matches two word characters followed by (non-capturing parens to contain the alternation) either whitespace or double-dash followed by digits. You can try it with hyphens and the wrong number of digits and a decoy id in the dog's name: `perlpie$ perl -e ' print "A \|\|118\|AVIAN. GILDED WILD HONEY. HM 1234567. HM--75081. 02 +-04-97" =~ /.+?\.\s+(\w\w(?:\s+\|--)\d+)\.\s+[-\d]+$/ ? "id is [$1]\n" : "id not found\n"; ' id is [HM--75081]` [download] If you wanted, you could get even more strict about the ID. You could also clean up the id to replace double-hypens with spaces before you do anything else with it. For more info on perl regular expressions, you'll want to check out perlre.	[reply] [d/l] [select]
Re: regexp identify variable number of digits within a sentence by repellent (Priest) on Sep 01, 2010 at 02:13 UTC
Help is in perlrequick, perlretut, and perlre. This ought to cover the test cases you mentioned: my @lines = ( " A \t\t118\tAVIANN GILDED WILD HONEY. HM 123456. 02-04-97 \n", " A \t\t118\tAVIANN-GILDED ... 'WILD' & HONEY. HP--09090901. 02-04 +-97 \n", ); for my $line (@lines) { my ($name, $reg, $dob) = ($line =~ / ^.+\t # ignore everything till the + last tab (.+?) # capture dog name \s+ ([[:alpha:]]{2} [-\s]{1,2} \d+)\.? # capture dog registration \s+ (\d{2}-\d{2}-\d{2}) # capture DOB \s*$ # anchor the end of regexp /x); print "Name : $name\n", "Reg : $reg\n", "DOB : $dob\n\n"; } __END__ Name : AVIANN GILDED WILD HONEY. Reg : HM 123456 DOB : 02-04-97 Name : AVIANN-GILDED ... 'WILD' & HONEY. Reg : HP--09090901 DOB : 02-04-97 [download]	[reply] [d/l]
Re: regexp identify variable number of digits within a sentence by ww (Archbishop) on Sep 01, 2010 at 02:42 UTC
TIMTOWTDI `if ( $dog =~ /[^\|]\\|[^\|]\\|[^\|]\\|([A-Z ])\.\s([A-Z]{2})\s(\d{6,8})\ +.\s(\d{2}-\d{2}-\d{2})/ ) { my $name = $1; my $id = $2 . $3; my $dob = $4;` [download] Three repeats of "anything not a pipe followed by a pipe"; capture the name; period; space; capture the two leading letters in the ID; space; capture 6 to 8 digits (see explanation above of unix vs perl handling of the numeric quantifier range); period, space; capture dob. I'm confused: at one point you state that the id may contain 6 to 8 (inclusive) digits but later write that you wish to allow 6-9 digits. Is that because sometimes a 9-digit id is "legal"? Arguably ugly; also arguably "(painfully) clear." Dealing with the permissable alternate characters in the name and with alternate (typo) hyphens in the id (hint: "quantifiers" and "alternation") left as an exercise for the OP.	[reply] [d/l]
Re: regexp identify variable number of digits within a sentence by Marshall (Canon) on Sep 01, 2010 at 06:10 UTC
I think it is fine to have a bunch of small steps rather than a single super regex. One thing to consider is that these scraped webpage "fixer" regex'es need to be modified all the time. You'll come across more goofy stuff further down the road - so try to be flexible. Performance is usually not a factor at all. My approach below. If split could work starting from right to left instead of the other way round, I'd use it! But alas it doesn't do that! Below I worked from right to left and used \S (non-space) and \s (space) Perl short-cuts. I think it is fine to use a combination of "fixing" and regex "splitting". In the case as explained so far, it is certainly possible to do everything in one regex. But, you've already seen the value in changing the tabs to \| characters so that you can see them and print. Sometimes these intermediate steps work out to be real handy for debugging! #!/usr/bin/perl -w use strict; my @tests = ( 'A \|\|118\|AVIANN GILDED WILD HONEY. HM 75081701. 02-04-97', 'A \|\|118\|\|\|AVIANN GILDED &^$WILD HONEY HP--09090901. 02-04-97', ); foreach my $test (@tests) { $test =~ s/^.\\|//; #remove beginning til last \| #"fix" possible typo in the registration number # HP--09090901. becomes HP 09090901. $test =~ s/(\w+)[-]+([\d.]+\s+\S+)$/$1 $2/; my ($name,$number,$date) = $test =~ m/^(.)\s+(\S+\s+\S+)\s+(\S+)$/ +; $number =~s/\.$//; #fix possible typo trailing '.' print "\n name=$name\n number=$number\n date=$date\n"; } __END__ name=AVIANN GILDED WILD HONEY. number=HM 75081701 date=02-04-97 name=AVIANN GILDED &^$WILD HONEY number=HP 09090901 date=02-04-97 [download]	[reply] [d/l]
Re: regexp identify variable number of digits within a sentence by aquarium (Curate) on Sep 01, 2010 at 00:51 UTC
standard unix regex uses the backslashed curly braces notation to indicate repeat ranges. perl regex doesn't use the backslashes. in other words use "m/A{3,4}/" in perl to match uppercase A, but only when it occurs 3 or 4 times consecutively. Just a note about the way the question is posed, which i've seen a lot. Don't close yourself off from the possibility of a non-regex solution being a better fit for the problem. Parsing can be done in many ways, especially so in perl. unpack and closure functions come to mind as possibilities. About the year in one of those fields being only two digits. in my opinion you should convert to four digit year, and maybe into a full date that's nicely later handled, e.g. 19700627. Making the transition early saves headaches later, if you're going to have to store the date in a DB or if you're ever going to have to calculate or even display it. the hardest line to type correctly is: stty erase ^H	[reply]