Regular Expression, Catching Variables

lev has asked for the wisdom of the Perl Monks concerning the following question:

Hello friends,
My experience with regular expressions is growing, but here I need some help.
I have a file with a typical line as follows:

$line = "2006-01-01,Kims,Watson,406,560(centrifuge, refrig.),569,607(dark room),210-211,101(ultracentrifuge),104-105(crystal growth rooms),660(centrifuge, refrig.)";
(It is a list of fields separated by commas: date, building,Prof(group), room or lab(function),...)
My object is to extract each field (between commas, outside any parentheses) to separate variables ($1, $2, $3…).

Making progress, slowly, I came to a promising but puzzling point with the regex:
$line =~ m/(\d{4}-\d\d-\d\d),(\w*),(\w*),(\w*,|\w*$.*?$,?|\w*-\w*,|\w*-\w*$.*?$,?)*/; #{8} replaces *

This gives me four of the expected 11 variables, the fourth being the last variable of the line. O.K., I understand that the match takes the last of (all|alternative|room|patterns)*, but I would like to have all the fields captured as variables.
When I use {n} in place of the ultimate '*', I get each alternative, ‘room (function)’ field in turn as n = 1 to 8.
Is there a way to catch all variables, i.e. $4 …$11 in one expression (without looping this all through values of n)?

Also, a more minor point, the use of ',?' allows for catching the end variable which alone is not followed by a comma. In the case here with 660(...)as the end variable, using '.?' after a 'room(function)'pattern(\w*(...),?) helps, but after a nondescript 'room'pattern( \w*,?), the match fails. So, in general, not knowing the end variable, how do I account for the commas to assure that the last variable is not lost?

Below is my test program.

#!/usr/local/bin/perl

use warnings;
use strict;

my $line = "2006-01-01,Kims,common,406,560(centrifuge,refrig.),569b,60
+7(dark room),210-211,101(ultracentrifuge),104-105(crystal growth room
+s),660(centrifuge,refrig.)";

$line =~ m/(\d{4}-\d\d-\d\d),(\w*),(\w*),(\w*,|\w*\(.*?\),?|\w*-\w*,|\
+w*-\w*\(.*?\),?)*/; #{8} replaces *

print "1:$1\n", "2:$2\n","3:$3\n","4:$4\n","5:$5\n","6:$6\n","7:$7\n",
+"8:$8\n","9:$9\n","10:$10\n","11:$11\n";
[download]

Thanks for your time and help, lev

Comment on Regular Expression, Catching Variables Download Code

Replies are listed 'Best First'.
Re: Regular Expression, Catching Variables by suaveant (Parson) on Jun 23, 2009 at 15:56 UTC
I would suggest something like Text::CSV for this but that isn't always an option... You're fields all seem to follow a rather basic pattern, so I would suggest that to make your life easier, you simplify how you are dealing with this... at a possible minor cost to efficiency you can just parse through the fields one by one and use a much simpler and MUCH easier to maintain regexp, like so `#!/usr/local/bin/perl use warnings; use strict; my $line = "2006-01-01,Kims,common,406,560(centrifuge,refrig.),569b,60 +7(dark room),210-211,101(ultracentrifuge),104-105(crystal growth room +s),660(centrifuge,refrig.)"; my @fields; push @fields, $1 while $line =~ /([^,(]+(?:$[^)]*$)?)/g; my $i = 1; print join(', ', map { $i++.": $_" } @fields),"\n";` [download] Which outputs `1: 2006-01-01, 2: Kims, 3: common, 4: 406, 5: 560(centrifuge,refrig.), + 6: 569b, 7: 607(dark room), 8: 210-211, 9: 101(ultracentrifuge), 10: + 104-105(crystal growth rooms), 11: 660(centrifuge,refrig.)` [download] Your original regexp is working just fine, but you only have 4 sets of capturing parens so you only get 4 fields... I would strongly suggest using the x modifier in big regexps like that to improve readability and also creating variables holding regexp pieces which match any fields that you can re-use the regexp for, so you only have to define a segment once.. should also add readability. - Ant - Some of my best work - (1 2 3)	[reply] [d/l] [select]
Re^2: Regular Expression, Catching Variables by Marshall (Canon) on Jun 23, 2009 at 16:54 UTC
Wow! Most excellent! Just a small addition, I think there is a missing ")" which I added below..right there at the tail-end ")/g". I also changed this to put the tokens directly into an array without the need for "while". `#!/usr/bin/perl -w use strict; my $line = "2006-01-01,Kims,Watson,406,560(centrifuge, refrig.),569,60 +7(dark room),210-211,101(ultracentrifuge),104-105(crystal growth room +s),660(centrifuge, refrig.)"; my @tokens = $line =~ m/([^,(]+(?:$[^)]$)?)/g; foreach my $token (@tokens) { print "$token\n"; } __END__ Prints: 2006-01-01 Kims Watson 406 560(centrifuge, refrig.) 569 607(dark room) 210-211 101(ultracentrifuge) 104-105(crystal growth rooms) 660(centrifuge, refrig.)` [download] Update: the only other small refinement would be to add () around the match-global to make it super clear that this is list context: `my @tokens = ($line =~ m/([^,(]+(?:$[^)]$)?)/g);` [download]	[reply] [d/l] [select]
Re^3: Regular Expression, Catching Variables by suaveant (Parson) on Jun 24, 2009 at 14:08 UTC
You are right, when I copied it I ended up with a space there instead of a paren, and deleted it, no idea what happened, thanks. - Ant - Some of my best work - (1 2 3)	[reply]
Re^4: Regular Expression, Catching Variables by Marshall (Canon) on Jun 26, 2009 at 23:15 UTC
Re: Regular Expression, Catching Variables by AnomalousMonk (Archbishop) on Jun 23, 2009 at 19:26 UTC
I certainly endorse using CPAN for complete solutions or for useful methods of attack. However, sometimes you just have to make your own wheel. The following approach decomposes regexes into much more easily understandable and maintainable parts. While it's a lot more typing to begin with, the gain in robustness and maintainability is, I find, well worth the cost. use warnings; use strict; my $rx_comma = qr{ \s* , \s* }xms; my $rx_date = qr{ \d{4} - \d\d - \d\d }xms; my $rx_name = qr{ [[:alpha:]] (?: '? [[:alpha:]]+)? }xms; my $rx_hyphenate = qr{ - $rx_name }xms; my $rx_surname = qr{ $rx_name $rx_hyphenate? }xms; my $rx_initial = qr{ [[:alpha:]] \. }xms; my $rx_givenname = qr{ $rx_initial \| $rx_surname }xms; my $rx_prof = qr{ $rx_surname (?: $rx_comma (?: \s* $rx_givenname )+ )? }xms; # avoid polluting namespace with a bunch of common variable names. my $rx_facility = do { my $room = qr{ \d{3,4} [[:alpha:]]? }xms; my $range = qr{ $room (?: \s* - \s* $room)? }xms; my $rooms = qr{ $range (?: $rx_comma $range)* }xms; my $function = qr{ $ [^)]+ $ }xms; qr{ $rooms \s* $function }xms; # final regex }; $/ = ""; # paragrep mode while (my $entry = <DATA>) { my ($date, $prof, @facilities) = $entry =~ m{ $rx_date \| $rx_prof \| $rx_facility }xmsg; print <<EOS; date: '$date' prof: '$prof' facilities: '@{[ join qq{' \n '}, @facilities ]}' EOS } __DATA__ 2006-01-01,O'Reilly,Watson B., 406,560(centrifuge,refrig.), 569b,607(dark room),210-211,101(ultracentrifuge), 104-105(crystal growth rooms),660(centrifuge, refrig.) 2007-02-02, Olsen, Alfa-Betty Z. , 102a-102c, 104(media lab) , 101(writer's lounge) 2008-03-04,Peebles, P.J.E., 1000a - 9999z (physical cosmology lab.), 000-001 (computational cosmology lab) [download] Output: `>perl regex_parse_fields_1.pl date: '2006-01-01' prof: 'O'Reilly,Watson B.' facilities: '406,560(centrifuge,refrig.)' '569b,607(dark room)' '210-211,101(ultracentrifuge)' '104-105(crystal growth rooms)' '660(centrifuge, refrig.)' date: '2007-02-02' prof: 'Olsen, Alfa-Betty Z.' facilities: '102a-102c, 104(media lab)' '101(writer's lounge)' date: '2008-03-04' prof: 'Peebles, P.J.E.' facilities: '1000a - 9999z (physical cosmology lab.)' '000-001 (computational cosmology lab)'` [download]	[reply] [d/l] [select]
Re: Regular Expression, Catching Variables by locked_user sundialsvc4 (Abbot) on Jun 23, 2009 at 16:19 UTC
In situations such as this one, I definitely favor using a list, instead of discrete variables such as `$1`. Also, in these situations, I recognize the existence of “a problem that has already been solved by someone else,” and I start trolling through CPAN to find a wedge. I'll study not only the code, but also the approach that is suggested by its authors. I do not want to do any “thing that has already been done.” Hair follicles are a precious thing.
Re: Regular Expression, Catching Variables by oko1 (Deacon) on Jun 23, 2009 at 17:32 UTC
Generally, the right tool for regularly-delimited data like this is 'split'. In your case, you'd probably want to use a regex to get rid of the content you don't want (i.e., the parenthesized bits) and then use 'split', e.g.: `$line =~ s/$[^)]+$//g; my @results = split /,/, $line; print "$_: $results[$_]\n" for 0 .. $#results;` [download] Regexes are usually used for data that's more of a challenge (i.e., does not follow any regular pattern.) Having said that, and since you've mentioned that you're doing this as a learning experience, here are a couple of suggestions: Unless you have a specific reason for doing so, try to avoid using the '*' quantifier in captures (parentheses): it's likely to mislead you, either by matching nothing or by matching too much, so that the remaining captures end up empty or undefined. A useful technique for capturing data followed by some delimiter is to capture a string of what I call "inverted delimiters": `$string = "abc,def;ghi"; $string =~ /^([^,]+),([^;]+);(.+)$/;` [download] I used that technique in the first snippet, to say "replace all '('s followed by any number of non-')'s, followed by a ')'". Last of all, you need to have a capture (parenthesis set in your regex) for every variable you expect to create. This is, of course, part of the pain of using a regex for a long, complicated line - and one of the reasons to try to automate the whole thing. You have four captures, and therefore, only four variables. Here's another technique that you may find useful for future reference: you can build a regex out of "pieces" each of which represents a field. The "work" part of this technique is in constructing one or more definitions of what a field is. `# Capture a 'non-comma/non-open-paren' string, optionally # followed by parens (not captured), optionally followed by a comma my $s = '([^,(]+)(?:$[^)]+$)?,?'; # Regex consists of 11 of these my $re = $s x 11; my @out = $line =~ /^$re$/; print "$_: $out[$_]\n" for 0 .. $#out;` [download] This is not, as you've probably guessed by now, an uncommon problem. :) -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply] [d/l] [select]
Re^2: Regular Expression, Catching Variables by ack (Deacon) on Jun 23, 2009 at 19:17 UTC
That's exactly what I was thinking. The only problem with a split (which it would seem to me in the OP's case the character to split on would be the commas) is that there are instances (e.g., in the OP's case of "which lab(s) are being used" for the activity might be separated by commas that the OP doesn't want to split on) where the commas need to not be split out. When I've done this sort of thing I have used a regex to go into the string and find the instances of commas that I wanted to keep (e.g., in this case any that appear between opening and closing parentheses) and change them to some other character such as a semi-colon so that it still carries the information but doesn't interfere with the splitting. I use CSV files a lot and split is almost always my friend. I rarely have had occasions that the OP is encountering, however, where I have had imbedded commas that needed to be not split upon. Consequently, on those infrequent occasions, I almost always have to "re-invent" a regex to find all of the non-splitting commas and change them to some other meaningful character (e.g., semi-colons) before doing the split. The regex always seem to beg for lookahead or lookbehind and I'm such a novice with regex'es that it is reoccuringly a major effort to get the regex right. So I'm ashamed that I can't be of help to the OP for that part. IMHO, okol's approach using split() is my preferred approach. But, of course, the OP may prefer or need to use regex's for all of it and I certainly respect that. ack Albuquerque, NM	[reply]
Re: Regular Expression, Catching Variables by cdarke (Prior) on Jun 23, 2009 at 15:54 UTC
Catching all the variables: just assign your RE to an array, or list. For example: `my @captures = $line =~ m/(\d{4}-\d\d-\d\d),(\w),(\w),(\w,\|\w$.? +$,?\|\w-\w,\|\w-\w$.?$,?)*/; #{8} replaces` [download]	[reply] [d/l]