Extraction number from Text

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extraction number from Text by moritz (Cardinal) on Jun 14, 2010 at 08:19 UTC
Now, how can I use OR i.e. \| symbol without having to code them between brackets. I'm not sure I understand your question. You need some form of bracketing construct to group the alternatives, because \| has a rather loose precedence. If the precedence was tigther than concatenation, `KG\|OZ\|CL` would be parsed as `K(G\|O)(Z\|C)L` which you wouldn't like either. If you want to avoid the capturing (ie that `(...)` associates the matched text with $1, $2 etc.) you can use `(?:...)` instead of `(...)`. That also does the grouping, but doesn't capture. See perlretut and perlre for more information. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^2: Extraction number from Text by Anonymous Monk on Jun 14, 2010 at 08:23 UTC
Thanks a lot, Moritz. (?:...) Did the trick. `my @number= 'HOMEPROUD #9613 WALL TILE WASHER HEAVY DUTY (1001KG)' =~ /$([-+\d.eE]*\d)(?:KG\|OZ\|CL\|LT\|LTR\|M\b)$/ig;`	[reply] [d/l]
Re: Extraction number from Text by davido (Cardinal) on Jun 14, 2010 at 08:27 UTC
The `(?:.....)` construct allows paranthetical constraining without capturing. `my @qty = 'GARNIER DEODORANT MINERALS - DRY CARE (50OZ)' =~ m/ $ ([+-\d.eE]\d+) (?:KG\|OZ\|CL\|LT\|LTR\|M)\b $ /igx;` [download] I also made a couple of functional changes to your RE, which may or may not be appropriate, but which I suspect are in keeping with what you're after: I allowed more than one \d digit, with \d+. Your RE captures up to two digits, but only one digit if there is a leading +, -, '.', e, or E. My example will capture however many consecutive digits present themselves. Be aware, however, that if you've got some number written in scientific notation I think* you're still only capturing the exponent, not the mantissa. I moved the word boundary \b outside of the constraining brackets so that it applies to KG, OZ, CL, LT, LTR, or M. Your example made it only apply to M. In other words, KG didn't require a word boundary, but M did, as in 'M\b'. I added the /x modifier to the RE so that it could be written in chunks with non-significant whitespace. This allows your code to be more readable. Have a look at perlre for a description of both `(?:....)` and the `/x` modifier. Dave	[reply] [d/l] [select]
Re^2: Extraction number from Text by moritz (Cardinal) on Jun 14, 2010 at 08:44 UTC
I allowed more than one \d digit, with \d+. It took me a while to grok it, but the original regex did allow more than one digit, and also captures it. That's because there is a \d in the character class, and the character class is quantified with a . However this also allows more than one dot or more than one e, so it recognizes `Ee.3` as a number. I agree that your regex is much better to read, but it doesn't allowe numbers before the exponential (I guess that's what the `e` in the regex is supposed to mean). Further refinements could use Regex::Common's number regex, or this regex, which parses numbers according to the JSON number specification: `my $number = qr{ -? (?: 0 \| [1-9] [0-9] ) (?: \. [0-9]+ )? (?: [eE] [+-]? [0-9] )? }x;` [download] (might be a bit too restrictive in some cases for parsing numbers "in the wild", but still a good inspiration). Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^3: Extraction number from Text by davido (Cardinal) on Jun 14, 2010 at 08:50 UTC
Great point with respect to the '' quantifier for the character class. The OP's example, which uses the '' quantifier would, of course, allow NAN's to be parsed as numbers. For example: "--eeeeeeeeeee1" would be accepted as a number when it's definitely not (although `perl` could evaluate that string in numeric context giving it a value of 1). I didn't attempt to address that issue, but it goes to punctuate your next point which is..... Regex::Common is a nice resource too. If there's a resource that knows how to parse numbers, why write ones own number parser when it (a) takes more time, and (b) possibly introduces bugs? Regex::Common is the answer to both 'a' and 'b'. Dave	[reply] [d/l]
Re^4: Extraction number from Text by JavaFan (Canon) on Jun 14, 2010 at 10:08 UTC
Re^4: Extraction number from Text by proceng (Scribe) on Jun 14, 2010 at 13:51 UTC
Re^2: Extraction number from Text by Anonymous Monk on Jun 14, 2010 at 08:41 UTC
Appreciate your inputs, Dave. Thanks.	[reply]