Re: Regular Expression, Catching Variables

Generally, the right tool for regularly-delimited data like this is 'split'. In your case, you'd probably want to use a regex to get rid of the content you don't want (i.e., the parenthesized bits) and then use 'split', e.g.:

$line =~ s/\([^)]+\)//g;
my @results = split /,/, $line;

print "$_: $results[$_]\n" for 0 .. $#results;
[download]

Regexes are usually used for data that's more of a challenge (i.e., does not follow any regular pattern.) Having said that, and since you've mentioned that you're doing this as a learning experience, here are a couple of suggestions:

Unless you have a specific reason for doing so, try to avoid using the '*' quantifier in captures (parentheses): it's likely to mislead you, either by matching nothing or by matching too much, so that the remaining captures end up empty or undefined.

A useful technique for capturing data followed by some delimiter is to capture a string of what I call "inverted delimiters":

$string = "abc,def;ghi";
$string =~ /^([^,]+),([^;]+);(.+)$/;
[download]

I used that technique in the first snippet, to say "replace all '('s followed by any number of non-')'s, followed by a ')'".

Last of all, you need to have a capture (parenthesis set in your regex) for every variable you expect to create. This is, of course, part of the pain of using a regex for a long, complicated line - and one of the reasons to try to automate the whole thing. You have four captures, and therefore, only four variables.

Here's another technique that you may find useful for future reference: you can build a regex out of "pieces" each of which represents a field. The "work" part of this technique is in constructing one or more definitions of what a field is.

# Capture a 'non-comma/non-open-paren' string, optionally
# followed by parens (not captured), optionally followed by a comma
my $s = '([^,(]+)(?:\([^)]+\))?,?';
# Regex consists of 11 of these
my $re = $s x 11;

my @out = $line =~ /^$re$/;

print "$_: $out[$_]\n" for 0 .. $#out;
[download]

This is not, as you've probably guessed by now, an uncommon problem. :)

--
"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf

Comment on Re: Regular Expression, Catching Variables Select or Download Code

Replies are listed 'Best First'.
Re^2: Regular Expression, Catching Variables by ack (Deacon) on Jun 23, 2009 at 19:17 UTC
That's exactly what I was thinking. The only problem with a split (which it would seem to me in the OP's case the character to split on would be the commas) is that there are instances (e.g., in the OP's case of "which lab(s) are being used" for the activity might be separated by commas that the OP doesn't want to split on) where the commas need to not be split out. When I've done this sort of thing I have used a regex to go into the string and find the instances of commas that I wanted to keep (e.g., in this case any that appear between opening and closing parentheses) and change them to some other character such as a semi-colon so that it still carries the information but doesn't interfere with the splitting. I use CSV files a lot and split is almost always my friend. I rarely have had occasions that the OP is encountering, however, where I have had imbedded commas that needed to be not split upon. Consequently, on those infrequent occasions, I almost always have to "re-invent" a regex to find all of the non-splitting commas and change them to some other meaningful character (e.g., semi-colons) before doing the split. The regex always seem to beg for lookahead or lookbehind and I'm such a novice with regex'es that it is reoccuringly a major effort to get the regex right. So I'm ashamed that I can't be of help to the OP for that part. IMHO, okol's approach using split() is my preferred approach. But, of course, the OP may prefer or need to use regex's for all of it and I certainly respect that. ack Albuquerque, NM	[reply]

Replies are listed 'Best First'.

Re^2: Regular Expression, Catching Variables
by ack (Deacon) on Jun 23, 2009 at 19:17 UTC

That's exactly what I was thinking. The only problem with a split (which it would seem to me in the OP's case the character to split on would be the commas) is that there are instances (e.g., in the OP's case of "which lab(s) are being used" for the activity might be separated by commas that the OP doesn't want to split on) where the commas need to *not* be split out.

When I've done this sort of thing I have used a regex to go into the string and find the instances of commas that I wanted to keep (e.g., in this case any that appear between opening and closing parentheses) and change them to some other character such as a semi-colon so that it still carries the information but doesn't interfere with the splitting.

I use CSV files a lot and split is almost always my friend. I rarely have had occasions that the OP is encountering, however, where I have had imbedded commas that needed to be *not* split upon.

Consequently, on those infrequent occasions, I almost always have to "re-invent" a regex to find all of the non-splitting commas and change them to some other meaningful character (e.g., semi-colons) before doing the split. The regex always seem to beg for lookahead or lookbehind and I'm such a novice with regex'es that it is reoccuringly a major effort to get the regex right. So I'm ashamed that I can't be of help to the OP for that part.

IMHO, okol's approach using split() is my preferred approach. But, of course, the OP may prefer or need to use regex's for all of it and I certainly respect that.

ack Albuquerque, NM

[reply]