geektron has asked for the wisdom of the Perl Monks concerning the following question:

i inherited a file-parser that *seems* to be broken. (ah, the wonders of a new job.)

i reduced the regex-matching to a minimal test script and commented the regex. it works in the test script but not in the real one.

here's the test case:

my $string = ' 5954 Bluebonnet Sun Catcher Horizontal Oval 6"X9" +1 @ $ 39.95 = $ 39.95 '; my $string2 = '5422 Texas Home Luggage Tag 6 @ $ 3 +.95 = $ 23.70'; + + + + + + + print " 1 set to $1\n match \n\n" if ( $string2 =~ /([\d]+) # match first digits ( greedy ) \s+ # multiple space ([\w\s|"|\-|\']+) # capture at word boundary \s{2,} # single space followed by multiple spaces ([\d,\.]+) # capture digit, comma, period \s\D+ # single space, non-digit character ([\d,\.]+) # capture digit comma period \s\D+ # space plus non-digit character ([\d,\.]+) # capture digit comma period /ix # end regex ) ;
and that *works*! when i apply the same regex in the real script ( cut-n-paste between windows) it fails to match. the only differences:
1. no $string. it matches against $_
2. no  if. it just matches, then pushes assignments into an array. basically:
my $prodID = $1; my $prodDescr = $2;
etc.

i've been banging my head on this and getting nowhere FAST. i've looked for hidden chars in the real file ( vi's 'set list' mode ) and found no differences. any clues as to where i should be looking?

added some cleanup. thx Roy Johnson

Replies are listed 'Best First'.
Re: regex matching problem
by Roy Johnson (Monsignor) on Jan 20, 2004 at 17:47 UTC
    There are some expression-comment disagreements to fix and improvements to make:
    /([\d]+) # match first nonspace ( greedy )
    No, that's matching digits.
    ([\w\s|"|\-|\']+) # capture at word boundary
    This is a mess, not a word boundary. You want it to be followed by spaces and digits, which it matches, so it's going to backtrack from the $ or @, so you could do ([^@\$]+) (update: backslashed $).
    \s\s+ # single space followed by multiple spaces
    Should be written \s{2,}, but ok.
    ([\d,\.]+) # capture digit, comma, period
    Any number of them, in any order.
    \s[^\d]+ # single space, non-digit character
    correct, but better as \s\D+

    Are the strings you're using in your test case the same strings that fail in the actual program?


    The PerlMonk tr/// Advocate
      ok, i guess my comments aren't all proper. :-(

      the original one was whitespaces ...

      the strings, at least on STDOUT, are identical. i just tossed in this in the 'real' parser:

      my $string2 = '5422 Texas Home Luggage Tag 6 @ $ 3 +.95 = $ 23.70'; print "identical strings \n" if $_ eq $string2;
      and it did, in fact, print "identical strings".
      ("|\-|\'+) # capture at word boundary
      This is a mess, not a word boundary. You want it to be followed by spaces and digits, which it matches, so it's going to backtrack from the $ or @, so you could do (^@$+).

      actually, that mess seems to be necessary. chaning it to your suggestion breaks the matching.

      $1 should be set to '5422' in the test string.

        The $ needs to be backslashed, so it isn't parsed as a variable. Then it works. I didn't expect variable interpolation in a character class. Oops.

        The PerlMonk tr/// Advocate
Re: regex matching problem
by ysth (Canon) on Jan 20, 2004 at 18:16 UTC
    That regex obviously has some problems (e.g. useless repetition of | characters in a character class) but I'm suspecting a logic error in the "real script" you aren't showing us. Can we see the code that doesn't work to compare to the code you show that you say does work?
      i feel like a ....

      i found the problem while trying to throw the stuff into my scratchpad. grrr. after applying some of [id:///300037]'s changes, i forgot to add something.

      my testcase had:  \s\D+
      my realcase had:  \s\D

      i know there are probably other issues w/ the regex, but for now, it works .... :-\

Re: regex matching problem
by Theo (Priest) on Jan 20, 2004 at 18:25 UTC
    Hi, geektron.
    You don't mention what kind of failure you're getting, and we don't see the code around your 'real script', so this is a shot in the dark.

    It may be that something else is tromping on either $_ or, IMO more likely, the $1, $2 etc, variables with other matches after your regex.

    -Theo-
    (so many nodes and so little time ... )