in reply to regex matching problem

There are some expression-comment disagreements to fix and improvements to make:
/([\d]+) # match first nonspace ( greedy )
No, that's matching digits.
([\w\s|"|\-|\']+) # capture at word boundary
This is a mess, not a word boundary. You want it to be followed by spaces and digits, which it matches, so it's going to backtrack from the $ or @, so you could do ([^@\$]+) (update: backslashed $).
\s\s+ # single space followed by multiple spaces
Should be written \s{2,}, but ok.
([\d,\.]+) # capture digit, comma, period
Any number of them, in any order.
\s[^\d]+ # single space, non-digit character
correct, but better as \s\D+

Are the strings you're using in your test case the same strings that fail in the actual program?


The PerlMonk tr/// Advocate

Replies are listed 'Best First'.
Re: Re: regex matching problem
by geektron (Curate) on Jan 20, 2004 at 18:01 UTC
    ok, i guess my comments aren't all proper. :-(

    the original one was whitespaces ...

    the strings, at least on STDOUT, are identical. i just tossed in this in the 'real' parser:

    my $string2 = '5422 Texas Home Luggage Tag 6 @ $ 3 +.95 = $ 23.70'; print "identical strings \n" if $_ eq $string2;
    and it did, in fact, print "identical strings".
Re: Re: regex matching problem
by geektron (Curate) on Jan 20, 2004 at 18:19 UTC
    ("|\-|\'+) # capture at word boundary
    This is a mess, not a word boundary. You want it to be followed by spaces and digits, which it matches, so it's going to backtrack from the $ or @, so you could do (^@$+).

    actually, that mess seems to be necessary. chaning it to your suggestion breaks the matching.

    $1 should be set to '5422' in the test string.

      The $ needs to be backslashed, so it isn't parsed as a variable. Then it works. I didn't expect variable interpolation in a character class. Oops.

      The PerlMonk tr/// Advocate
        ah. i'll try it on the upcoming revision of the regex.

        i discovered that "product numbers" in the data files can also look like this: 4444-NC, 3434-43, etc.

        my regex-fu was never that great. it's sure gonna get a workout.