classifying data

sweetblood has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to come up with a peice of regex to classify data types. The way I need to classify these data are numeric and non-numeric. And I need to sub-classify numeric as numeric w/dollar sign/plus signs/neg signs/commas/decimal point. The idea is that I'll be passing fields through this expression and it should return it's classification. To further complicate issues if a field contains something like: "100$00", it would be non-numeric. It would only be numeric if the dollar sign is at the beginning although "+$100" would be numeric as would "-$100". I've been playing with this for a couple of days, so I'm really getting frustated. Here's what I've been trying(unsuccessfully):

/^(\+)?(-)?(\$)?(\d)*(\.)?\d*$/
[download]

I realize that this is flawed and that this may not even be the best way to go about this. Also, this would only account for numeric data not the non-numeric. So I still have to take that into consideration. Whatever method I use it must be efficient as there maybe literally billions of fields being passed through this routine.
I really didn't expect this to be a real stumbling block when I projected my time estimates for this project. It seemed so trivial. I guess after 20+ years of coding (not Perl) I've gotten cocky, I should've realized by now that nothing that gets that many hits is going to be trivial.

Comment on classifying data Download Code

Replies are listed 'Best First'.
Re: classifying data by Abigail-II (Bishop) on Jan 19, 2004 at 16:32 UTC
Well, you make use of Regexp::Common. For instance, by doing: `use Regexp::Common; $has_dollar_sign = $str =~ s/^([-+]?)\$/$1/; if ($str =~ /^$RE{num}{decimal}{-sep => ','}{-keep}$/) { $is_numeric = 1; $sign = $2; $has_decimal_point = $5 ? 1 : 0; $has_commas = $str =~ /,/ }` [download] Or you could take its regex and modify it to have an optional leading dollar sign. Abigail	[reply] [d/l]
Re: Re: classifying data by Art_XIV (Hermit) on Jan 19, 2004 at 16:45 UTC
This could help you get started: `use warnings; use strict; while (<DATA>) { chomp; my $data = $_; print "$data: "; my $result = ($data =~ /^[+\-]?\$?[0-9.]+$/) ? "numeric" : "non-numeric"; print "$result\n"; } __DATA__ 1020 $10.21 -1023 +1.024 beer 10$25 -$102.6 A1027 1028$ $ $-1.029` [download] BTW, watch out for those asterisk opertors in your patters. You used `\d`, and the `''` could legitimately match `''`, since it does match a digit zero or more times! That pesky asterisk operator can lead to 'zero-width' matches, which can drive you nuts when you are starting with regular expresssions. Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"	[reply] [d/l] [select]
Re: classifying data by Abigail-II (Bishop) on Jan 19, 2004 at 16:50 UTC
Your regexp will not match `900,000`, but it will match `....`. Abigail	[reply] [d/l] [select]
Re: Re: classifying data by Art_XIV (Hermit) on Jan 19, 2004 at 17:40 UTC
Re: classifying data by flatline (Novice) on Jan 19, 2004 at 19:19 UTC
This got really complex, I can understand your confusion. Does this work for you? I've tested it as thoroughly as I can think to in a few minutes: `#!perl use strict; my @data = qw/ +20 400.00 $5,000.00 -$860.7 $-26.01 90,000,000 blah te +st 45$99.0 $5+8.2 $,000 /; for (@data) { if (/^(\$\|-\|\+\|\$-\|\$\+\|-\$\|\+\$\|\d)/) { if (/^\d((,\d{3})\|(\d)\|(\.\d{1,2}))+$/) { print "$_ is a numeric!\n"; } elsif (/^(\$\|-\|\+\|\$-\|\$\+\|-\$\|\+\$)\d{1,3}((,\d{3})\|(\d)\|( +\.\d{1,2}))+$/) { print "$_ is a dollar amount!\n"; } else { print "$_ is non-numeric!\n"; } } }` [download]	[reply] [d/l]
Re: classifying data by halley (Prior) on Jan 19, 2004 at 17:38 UTC
If you're just trying to come up with a numeric/non-numeric classification, one regex should be okay for most cases. `$value = undef; $value = (0 + "$1$3") if $thing =~ m/ ^ (\-\|\+)? # optional sign: $1 (\$)? # optional dollar sign: $2 ( \d+ # at least one digit (,\d\d\d)* # zero or more comma groups (\.\d*)? # optional fractional part \| (\.\d+) # only a fractional part ) # the whole mantissa: $3 $ /x; print "numeric! value = $value\n" if defined $value;` [download] I haven't tested this, but it should cover all the basic cases without scientific, but assumes commas are thousands-separators and the decimal point is the fraction separator. You might want to be lenient about leading and trailing spaces, or dollar-before-sign ($-34.00) cases. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l]
Re: Re: classifying data by rir (Vicar) on Jan 19, 2004 at 20:21 UTC
This seems to accept `123456,123.00`. Be well.	[reply] [d/l]
Re: Re: Re: classifying data by paulbort (Hermit) on Jan 19, 2004 at 23:43 UTC
Easily fixed, just give the first digit group an explicit count: `\d{1,3} # at least one digit` [download] -- Spring: Forces, Coiled Again!	[reply] [d/l]
Re: Re: Re: classifying data by halley (Prior) on Jan 20, 2004 at 03:42 UTC
Yes, `\d+(,\d{3})` will accept "123456,123.00". Perl accepts 123456_123.00 as one number, also. If you wish not to be so accepting, then you may have to deal with more than two choices in the key alternation. The suggested expression `\d{1,3}(,\d{3})` would reject "123456123.00", since it lacks commas. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Another method. by grendelkhan (Sexton) on Jan 19, 2004 at 22:38 UTC
Here's my take on it, which I now notice looks decidedly similar to at least one other example. Without the possibility of commas as thousands-separators, it's about half the length it is here. Also, note that the plus/minus must always precede the dollar sign (to fix this, make `[+\-]?\$?` into `(([+\-]?\$?)\|(\$?[+\-]?))`. (Can those parens be eliminated somewhat? I don't know off the top of my head.) #!/usr/bin/perl -w # bunched up: /^[+\-]?\$?(\d+(\.\d+)?)(\d{1,3}(,\d\d\d)+(\.\d+)?)$/ $match = qr/^ [+\-]? # optional sign \$? # optional dollar ( # version without commas \d+ # one or more digits (\.\d+)? # optional decimal-plus-more-digits ) ( # version with commas \d{1,3} # one to three digits (,\d\d\d)+ # groups of three (\.\d+)? # optional decimal-plus-more-digits ) $/x; while (<DATA>) { chomp; print "$_: ".(($_ =~ $match) ? 'num' : 'non-num')."\n"; } __DATA__ 1020 $10.21 -1023 +1.024 beer 10$25 -$102.6 A1027 10,000 15,00,000 10,000.001 1028$ $ $-1.029 [download] --grendelkhan	[reply] [d/l] [select]
Re: classifying data by dominix (Deacon) on Jan 20, 2004 at 10:26 UTC
if your Data are getting more complex along time, write a parser with Parse::RecDescent or Parse::Yapp if encounter performance issue -- dominix	[reply]
Re: classifying data by Anonymous Monk on Jan 20, 2004 at 21:54 UTC
How about this: `#!/usr/bin/perl -w use strict; my @data = qw/ this123isnot -$100 100_00 100$00 10000 +$100 $+100 this + is not /; foreach (@data) { if (/^(?:(?:[-+]?\$?)\|(?:\$?[-+]))(?:\d{1,3}(?:,\d{3})\|\d+)(?:\.\ +d{1,2})?$/) { print "$_ : numeric!\n"; } else { print "$_ : non-numeric!\n"; } }` [download] Remember, if you don't need the $1, $2, etc... variables, use (?: ) which are faster since they don't save the result. This looks for the begginning of line, then an optional - or + followed by or preceded by a $. Then \d{1,3}(?:,\d{3}) looks for 1-3 digits followed by a , followed by 3 digits OR \d+ one or more digits. Then for the decimal, we check for the optional \.\d{1,2}? then the end of line. Also be careful, your original regex will match ^$ as well since everything is optional.	[reply] [d/l]