sweetblood has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to come up with a peice of regex to classify data types. The way I need to classify these data are numeric and non-numeric. And I need to sub-classify numeric as numeric w/dollar sign/plus signs/neg signs/commas/decimal point. The idea is that I'll be passing fields through this expression and it should return it's classification. To further complicate issues if a field contains something like: "100$00", it would be non-numeric. It would only be numeric if the dollar sign is at the beginning although "+$100" would be numeric as would "-$100". I've been playing with this for a couple of days, so I'm really getting frustated. Here's what I've been trying(unsuccessfully):
/^(\+)?(-)?(\$)?(\d)*(\.)?\d*$/
I realize that this is flawed and that this may not even be the best way to go about this. Also, this would only account for numeric data not the non-numeric. So I still have to take that into consideration. Whatever method I use it must be efficient as there maybe literally billions of fields being passed through this routine.
I really didn't expect this to be a real stumbling block when I projected my time estimates for this project. It seemed so trivial. I guess after 20+ years of coding (not Perl) I've gotten cocky, I should've realized by now that nothing that gets that many hits is going to be trivial.

Replies are listed 'Best First'.
Re: classifying data
by Abigail-II (Bishop) on Jan 19, 2004 at 16:32 UTC
    Well, you make use of Regexp::Common. For instance, by doing:
    use Regexp::Common; $has_dollar_sign = $str =~ s/^([-+]?)\$/$1/; if ($str =~ /^$RE{num}{decimal}{-sep => ','}{-keep}$/) { $is_numeric = 1; $sign = $2; $has_decimal_point = $5 ? 1 : 0; $has_commas = $str =~ /,/ }
    Or you could take its regex and modify it to have an optional leading dollar sign.

    Abigail

      This could help you get started:

      use warnings; use strict; while (<DATA>) { chomp; my $data = $_; print "$data: "; my $result = ($data =~ /^[+\-]?\$?[0-9.]+$/) ? "numeric" : "non-numeric"; print "$result\n"; } __DATA__ 1020 $10.21 -1023 +1.024 beer 10$25 -$102.6 A1027 1028$ $ $-1.029

      BTW, watch out for those asterisk opertors in your patters. You used \d*, and the '*' could legitimately match '', since it does match a digit zero or more times! That pesky asterisk operator can lead to 'zero-width' matches, which can drive you nuts when you are starting with regular expresssions.

      Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
        Your regexp will not match 900,000, but it will match .....

        Abigail

Re: classifying data
by flatline (Novice) on Jan 19, 2004 at 19:19 UTC
    This got really complex, I can understand your confusion. Does this work for you? I've tested it as thoroughly as I can think to in a few minutes:
    #!perl use strict; my @data = qw/ +20 400.00 $5,000.00 -$860.7 $-26.01 90,000,000 blah te +st 45$99.0 $5+8.2 $,000 /; for (@data) { if (/^(\$|-|\+|\$-|\$\+|-\$|\+\$|\d)/) { if (/^\d((,\d{3})|(\d*)|(\.\d{1,2}))+$/) { print "$_ is a numeric!\n"; } elsif (/^(\$|-|\+|\$-|\$\+|-\$|\+\$)\d{1,3}((,\d{3})|(\d*)|( +\.\d{1,2}))+$/) { print "$_ is a dollar amount!\n"; } else { print "$_ is non-numeric!\n"; } } }
Re: classifying data
by halley (Prior) on Jan 19, 2004 at 17:38 UTC
    If you're just trying to come up with a numeric/non-numeric classification, one regex *should* be okay for most cases.
    $value = undef; $value = (0 + "$1$3") if $thing =~ m/ ^ (\-|\+)? # optional sign: $1 (\$)? # optional dollar sign: $2 ( \d+ # at least one digit (,\d\d\d)* # zero or more comma groups (\.\d*)? # optional fractional part | (\.\d+) # only a fractional part ) # the whole mantissa: $3 $ /x; print "numeric! value = $value\n" if defined $value;
    I haven't tested this, but it should cover all the basic cases without scientific, but assumes commas are thousands-separators and the decimal point is the fraction separator. You might want to be lenient about leading and trailing spaces, or dollar-before-sign ($-34.00) cases.

    --
    [ e d @ h a l l e y . c c ]

      This seems to accept 123456,123.00.

      Be well.

        Easily fixed, just give the first digit group an explicit count:
        \d{1,3} # at least one digit


        --
        Spring: Forces, Coiled Again!
        Yes, \d+(,\d{3})* will accept "123456,123.00". Perl accepts 123456_123.00 as one number, also. If you wish not to be so accepting, then you may have to deal with more than two choices in the key alternation. The suggested expression \d{1,3}(,\d{3})* would reject "123456123.00", since it lacks commas.

        --
        [ e d @ h a l l e y . c c ]

Another method.
by grendelkhan (Sexton) on Jan 19, 2004 at 22:38 UTC
    Here's my take on it, which I now notice looks decidedly similar to at least one other example. Without the possibility of commas as thousands-separators, it's about half the length it is here. Also, note that the plus/minus must always precede the dollar sign (to fix this, make [+\-]?\$? into (([+\-]?\$?)|(\$?[+\-]?)). (Can those parens be eliminated somewhat? I don't know off the top of my head.)
    #!/usr/bin/perl -w # bunched up: /^[+\-]?\$?(\d+(\.\d+)?)(\d{1,3}(,\d\d\d)+(\.\d+)?)$/ $match = qr/^ [+\-]? # optional sign \$? # optional dollar ( # version without commas \d+ # one or more digits (\.\d+)? # optional decimal-plus-more-digits ) ( # version with commas \d{1,3} # one to three digits (,\d\d\d)+ # groups of three (\.\d+)? # optional decimal-plus-more-digits ) $/x; while (<DATA>) { chomp; print "$_: ".(($_ =~ $match) ? 'num' : 'non-num')."\n"; } __DATA__ 1020 $10.21 -1023 +1.024 beer 10$25 -$102.6 A1027 10,000 15,00,000 10,000.001 1028$ $ $-1.029
    --grendelkhan
Re: classifying data
by dominix (Deacon) on Jan 20, 2004 at 10:26 UTC
Re: classifying data
by Anonymous Monk on Jan 20, 2004 at 21:54 UTC

    How about this:

    #!/usr/bin/perl -w use strict; my @data = qw/ this123isnot -$100 100_00 100$00 10000 +$100 $+100 this + is not /; foreach (@data) { if (/^(?:(?:[-+]?\$?)|(?:\$?[-+]))(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\ +d{1,2})?$/) { print "$_ : numeric!\n"; } else { print "$_ : non-numeric!\n"; } }

    Remember, if you don't need the $1, $2, etc... variables, use (?: ) which are faster since they don't save the result.
    This looks for the begginning of line, then an optional - or + followed by or preceded by a $. Then \d{1,3}(?:,\d{3})* looks for 1-3 digits followed by a , followed by 3 digits OR \d+ one or more digits. Then for the decimal, we check for the optional \.\d{1,2}? then the end of line.
    Also be careful, your original regex will match ^$ as well since everything is optional.