Re: Another regexp question

First, regexp is not the right tool for parsing/analyzing natural language.

However if you only expect sentences conform with certain predefined pattern/structure, then regexp (or some simple snippet) would be quite useful to analyze/parse those sentences. If all of your sentences are as simple as what you showed us above, I don't even think you need "multiple if's". It is all about understand and define the pattern/structure.

In your case, you may come up with those rules:

subject is a stream of chars;
object is a stream of chars (number?);
A "sentence" is subject + some form of be verb + object;
A "sentence" ends with "and", ".", ", and", etc.

Then it is not difficult for you to come up with a short snippet or a regexp to parse your "sentences" into a collection of "subject-object" pairs.

Comment on Re: Another regexp question

Replies are listed 'Best First'.
Re: Re: Another regexp question by carric (Beadle) on Nov 19, 2003 at 07:26 UTC
This was a very simplified example for what I want to do, but I was sure there was a better way than doing an if() and grabbing parts of the match over and over on $_. I am keenly interested in your reference to "not using regexp to parse natural language". I am a beginner and have no programming background so my code is all a really ugly hack. A project I was working on is parsing foreclosure ads. You can find them all over the net, but basically you can scrape the ads (which thusfar have been one line per ad) and then then parse out the relevant info like price, dates, plat book, etc. Everything useful/relevant to the sale. There is no good format for these ads, and it appears each attorney does their own thing.. sometimes you have an address, sometimes you have a description of the property. I have a butt-ugly hack that can do it to some degree, but I know it could probably qualify for all time worst code ever written. Thank you for your help!!	[reply]
Re: Re: Re: Another regexp question by Wassercrats (Initiate) on Nov 19, 2003 at 08:03 UTC
You could probably extract certain key words that you specify, no matter where they appear, but other than that, I don't think you would be able to do what you are asking for, especially when it's an ad you're parsing, which would be in ad-english rather than proper english. It wouldn't be good enough to use: `my @capture = $str =~ /(rabbits\|dogs\|\d+-\d+)/g;` if the form might change. You would have to make it case insensitive, allow for non-digits and hyphens and parenthesis in the phone number, etc. I don't know what the one-line foreclosure ads tend to look like, but I guess you could have prices that look like telephone numbers (without a $) and encounter other problems. Maybe you could find some less complete solution, such as identifying when an ad contains a single string of numbers (with possible commas or periods in the proper places) that's preceded by a dollar sign. If you were hoping for some kind of "search-by" feature, maybe you better make it for prices in that format only.	[reply] [d/l]