dcorbin has asked for the wisdom of the Perl Monks concerning the following question:

Consider this (overly simplified) issue. I have a string, that contains in it an optional "list" of foos. Each foo has a recognizeable and distinct pattern, and of course there are many different specific foos. I want to see if my list of foos contains a certain specific foos. This problem must be solved with a regex (for other reasons). Example: A foo is a one letter an one number. They are separated by by one or more spaces. The list is preceded by a list name and is separated from the list in the "usual fashion".

Example input: bar a2 b2 c3 d3 d4 a1 f2 f3

I want the regex to be true if my list of foos contains d4 and b2.
$_ = /^\w+		# matches list name
	(?:
	\s+
	[A-Za-z]\d
	)*
$/x
will match a list of foos with it's label.

But, I need a way to match specific foos in any place in any order insided the pattern. Is this possible? How? Thanks.

Replies are listed 'Best First'.
Re: Set theory in regexs
by merlyn (Sage) on Sep 02, 2000 at 20:03 UTC
    I tend to get suspicious when I see stuff like:
    This problem must be solved with a regex (for other reasons).
    Because it means you are artificially restricting the implementation to something that doesn't work well, except when you are toying with the language instead of doing real work with it. Often the case with doing homework, but I won't go that far to accuse you of doing a homework problem using the resources of the monestary. {grin}

    The problem you ask is not solved well by a regex. Certainly, you can see if a string has both "d4" and "b2" in it using two forward lookaheads:

    my $good = $string =~ /^(?=.*?(\s|^)d4(\s|$))(?=.*?(\s|^)b2(\s|$))/;
    But this is not particularly efficient, nor will any solution be efficient if you restrict your implementation to "must be solved with a regex".

    Please explain the need for a regex.

    -- Randal L. Schwartz, Perl hacker

      The need for a regex solution is as you surmise, partially arbirtrary (no, I'm not doing homework, it's been 15+ years since I had any to do). The problem is that this particular problem is only a small portion of something I'm trying to solve as part of a large regex. I have considered other non-regex solutions, and I may end up returning to them, but there will be certain grace to solving the "big problem" with regex if practical.

      As for inefficiency, the question becomes "how inefficient?". While what I hope to do is construct a "very massive" regex, If it can be applied to a 2K string and resolved in under a minute on single processer Intel box, that's probably fast enough for my purposes. (I realize that the question "How inefficient? is difficult to answer without a far better understanding of the entire pattern and the input, but I thought I'd ask anyway :)

      I've never used lookaheads, so I'll have to "do some research" to understand your example.

      Thanks.
(jeffa) Re: Set theory in regexs
by jeffa (Bishop) on Sep 02, 2000 at 19:55 UTC
    I hope I am understanding your question correctly. This little piece of code takes a string delimited by spaces, splits it up into a list, and exits with a true flag if any one element of the list contains exactly one letter followed by exactly one digit.
    use strict; my $string = 'bar a2 b2 c3 d3 d4 a1 f2 f3'; my $flag = 0; foreach(split(/\s+/, $string)) { $flag++, last if /\w\d/; } print "foo's found in list\n" if $flag;
    If you wanted to specify a list of foo's to match against, such as only d4 or b2, you could try this instead . . .
    use strict; my @foos = qw(d4 b2); my $string = 'bar a2 b2 c3 d3 d4 a1 f2 f3'; my $flag = 0; foreach my $cand (split(/\s+/, $string)) { foreach(@foos) { $flag++, last if $cand =~ /$_/; } } print "foo's found in list\n" if $flag;
    Hope this helps, Jeff

    UPDATE: Oops, you wanted d4 AND b2 - luckily for me, merlyn has got you covered . . .

Re: Set theory in regexs
by tye (Sage) on Sep 03, 2000 at 09:44 UTC

    Another alternative for doing this with a regexp is also quite bad:

    /\bd4\b.*\bb2\b|\bb2\b.*\bd4\b/

    But it gets really bad if you are looking for more than two members. Let's say we want to match "sets" containing all of c3, d4, and e5:

    /\bc3\b.*(\bd4\b.*\be5\b|\be5\b.*\bd4\b) |\bd4\b.*(\bc3\b.*\be5\b|\be5\b.*\bc3\b) |\be5\b.*(\bc3\b.*\bd4\b|\bd4\b.*\bc3\b)/x

    This gets so bad so fast that I won't even attempt to automate the generation of these regexen for matching larger numbers of elements [though it wouldn't be hard ;> ].

    If you know that these sets never have duplicates, then a fairly simple regex has hopes of working and being not horribly slow:

    my $want= qr/\b(?:c3|d4|e5)\b/; my $match= qr/$want.*?$want.*?$want/;

    So $want matches any member that is one of the members that we want while $match matches a set that contains 3 members that we want. If sets contain no duplicates and we want 3 (unique) members, then $match matches what we want. If use ".*?" because I think it will be faster. YMMV.

            - tye (but my friends call me "Tye")