Another regexp question

carric has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Another regexp question by Roger (Parson) on Nov 19, 2003 at 04:06 UTC
Another single regular expression, with the `@array = $str =~ /regexp/` idiom - `my $str = 'The rabbits is $10 and the dogs are $20. The phone number i +s 555-1212.'; # updated: thanks to davido to point out the interpolation # of $10 and $20 in my double quoted string. I have changed # the double quote to single quote. my @capture = $str =~ /(rabbits\|dogs\|\d+-\d+)/g; print "$_\n" for @capture;` [download] To be more elaborate, I have constructed the following example to demonstrate how to capture into a hash and an array. `use strict; use Data::Dumper; my $str = 'The rabbits is $10 and the dogs are $20. ' . 'The phone number is 555-1212, mobile number 0404-120021'; my $animal = "rabbit\|dog"; my %prices = $str =~ m/((?:$animal)s?)\s(?:is\|are)\s(\$\d+)/g; my @phone = $str =~ m/(\d+-\d+)/g; print Dumper(\%prices); print Dumper(\@phone);` [download] And the output is - `$VAR1 = { 'dogs' => '$20', 'rabbits' => '$10' }; $VAR1 = [ '555-1212', '0404-120021' ];` [download] In general the complexity of the regular expression increases if the number of requirement increases, as well as the complexity of your sentense structure. You will have to pick one best suited to your data. And of course if you want to parse natural language, automatically recognise what is an animal, and pick out the price from a complex sentense, it will be a mammoth task indeed. Try pick out the prices from the following sentense :-) I have a dog and two cats, I will charge you $10 for each cat, but I won't sell the dog to you for $10, I will have to charge you $20 extra.	[reply] [d/l] [select]
Re: Re: Another regexp question by wolis (Scribe) on Nov 20, 2003 at 04:06 UTC
Hi There, In asking my question for clarification I belive I have answered myself but anyway: Can I have some clarification on the wonderful line: `my %prices = $str =~ m/((?:$animal)s?)\s(?:is\|are)\s(\$\d+)/g;` [download] As I understand it the brackets are use to return the values as $1, $2 .. $9 It looks like (?: .. ) is something special and is not returning a numbered variable so all we get out are two variables $1 and $2 used respectively in the hash as the key and value? I have often wanted to check to see if a string contains one of multiple sub-strings.. I assume I could do it like this: `use strict; my $something = 'This is a bang of a bing thing'; if($something =~ m/((?:bing\|bong\|bang))/i) { print "Found '$1' in '$something'\n"; }` [download] Woo Hoo! `Found 'bang' in 'This is a bang of a bing thing'` [download] It returned the first one found.. which is fair.. I wonder if this could return all matches found, in this case 'band' and 'bing'? And could I check for a string contain anything from a list? `... my @list = qw / bing bong bang /; if($something =~ m/(?:@list)/i) ...` [download] naturally does not work :-( thanks `___ /\__\ "What is the world coming to?" \/__/ www.wolispace.com` [download]	[reply] [d/l] [select]
Re: Re: Re: Another regexp question by Roger (Parson) on Nov 20, 2003 at 04:32 UTC
It looks like `(?: .. )` is something special and is not returning a numbered variable so all we get out are two variables $1 and $2 used respectively in the hash as the key and value? You bet. ;-) The `?:` in the bracket tells Perl not to capture the pattern inside the bracket. You can find the documentation on `(?:pattern)` on the CPAN perlre documentation here And could I check for a string contain anything from a list? Well, yes you can. The method I use is to construct the search pattern with a join, as the following example demonstrates - `my $something = 'This is a bang of a bing thing'; my @list = qw /bing bong bang/; # want to search for these my $list = join '\|', @list; # construct my pattern if($something =~ m/($list)/i) { print "Found '$1' in '$something'\n"; }` [download] If you want to capture all occurances of the patterns, you could use the `@array = $str =~ m/pattern/g` idiom. `my @search = $something =~ m/($list)/ig; # <- added the g modifier` [download] or you could do this in a while loop - `while ($something =~ m/($list)/ig) { print "Found '$1' in '$something'\n"; }` [download] The problem with your code is that `m/(@list)/i` is looking for the pattern of the interpolated list items, the pattern "bing bong bang", in the string, and of cause it is not found. `use strict; my $something = 'This is a bang of a bing thing bing bong bang'; my @list = qw / bing bong bang /; if ($something =~ m/(@list)/i) { print "Found '$1' in '$something'\n"; }` [download] And the output is - `Found 'bing bong bang' in 'This is a bang of a bing thing bing bong ba +ng'` [download]	[reply] [d/l] [select]
Re: Re: Re: Re: Another regexp question by wolis (Scribe) on Nov 21, 2003 at 03:45 UTC
Re: Re: Re: Re: Re: Another regexp question by Roger (Parson) on Nov 21, 2003 at 03:53 UTC
Re: Re: Re: Re: Re: Another regexp question by davido (Cardinal) on Nov 21, 2003 at 04:01 UTC
Some notes below your chosen depth have not been shown here
Re: Another regexp question by pg (Canon) on Nov 19, 2003 at 06:53 UTC
First, regexp is not the right tool for parsing/analyzing natural language. However if you only expect sentences conform with certain predefined pattern/structure, then regexp (or some simple snippet) would be quite useful to analyze/parse those sentences. If all of your sentences are as simple as what you showed us above, I don't even think you need "multiple if's". It is all about understand and define the pattern/structure. In your case, you may come up with those rules: subject is a stream of chars; object is a stream of chars (number?); A "sentence" is subject + some form of be verb + object; A "sentence" ends with "and", ".", ", and", etc. Then it is not difficult for you to come up with a short snippet or a regexp to parse your "sentences" into a collection of "subject-object" pairs.	[reply]
Re: Re: Another regexp question by carric (Beadle) on Nov 19, 2003 at 07:26 UTC
This was a very simplified example for what I want to do, but I was sure there was a better way than doing an if() and grabbing parts of the match over and over on $_. I am keenly interested in your reference to "not using regexp to parse natural language". I am a beginner and have no programming background so my code is all a really ugly hack. A project I was working on is parsing foreclosure ads. You can find them all over the net, but basically you can scrape the ads (which thusfar have been one line per ad) and then then parse out the relevant info like price, dates, plat book, etc. Everything useful/relevant to the sale. There is no good format for these ads, and it appears each attorney does their own thing.. sometimes you have an address, sometimes you have a description of the property. I have a butt-ugly hack that can do it to some degree, but I know it could probably qualify for all time worst code ever written. Thank you for your help!!	[reply]
Re: Re: Re: Another regexp question by Wassercrats (Initiate) on Nov 19, 2003 at 08:03 UTC
You could probably extract certain key words that you specify, no matter where they appear, but other than that, I don't think you would be able to do what you are asking for, especially when it's an ad you're parsing, which would be in ad-english rather than proper english. It wouldn't be good enough to use: `my @capture = $str =~ /(rabbits\|dogs\|\d+-\d+)/g;` if the form might change. You would have to make it case insensitive, allow for non-digits and hyphens and parenthesis in the phone number, etc. I don't know what the one-line foreclosure ads tend to look like, but I guess you could have prices that look like telephone numbers (without a $) and encounter other problems. Maybe you could find some less complete solution, such as identifying when an ad contains a single string of numbers (with possible commas or periods in the proper places) that's preceded by a dollar sign. If you were hoping for some kind of "search-by" feature, maybe you better make it for prices in that format only.	[reply] [d/l]
Re: Another regexp question by QM (Parson) on Nov 19, 2003 at 04:17 UTC
Difficulties abound here. Show us some code, so we can better direct you. If you can specify the input better, and give examples of input/output,we can write something up. Are these statements expected to appear together? Are they sprinkled throughout unformatted text? -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]