in reply to Re: Another regexp question
in thread Another regexp question

This was a very simplified example for what I want to do, but I was sure there was a better way than doing an if() and grabbing parts of the match over and over on $_.

I am keenly interested in your reference to "not using regexp to parse natural language". I am a beginner and have no programming background so my code is all a really ugly hack. A project I was working on is parsing foreclosure ads. You can find them all over the net, but basically you can scrape the ads (which thusfar have been one line per ad) and then then parse out the relevant info like price, dates, plat book, etc. Everything useful/relevant to the sale. There is no good format for these ads, and it appears each attorney does their own thing.. sometimes you have an address, sometimes you have a description of the property. I have a butt-ugly hack that can do it to some degree, but I know it could probably qualify for all time worst code ever written.

Thank you for your help!!

Replies are listed 'Best First'.
Re: Re: Re: Another regexp question
by Wassercrats (Initiate) on Nov 19, 2003 at 08:03 UTC
    You could probably extract certain key words that you specify, no matter where they appear, but other than that, I don't think you would be able to do what you are asking for, especially when it's an ad you're parsing, which would be in ad-english rather than proper english. It wouldn't be good enough to use:

    my @capture = $str =~ /(rabbits|dogs|\d+-\d+)/g;

    if the form might change. You would have to make it case insensitive, allow for non-digits and hyphens and parenthesis in the phone number, etc. I don't know what the one-line foreclosure ads tend to look like, but I guess you could have prices that look like telephone numbers (without a $) and encounter other problems.

    Maybe you could find some less complete solution, such as identifying when an ad contains a single string of numbers (with possible commas or periods in the proper places) that's preceded by a dollar sign. If you were hoping for some kind of "search-by" feature, maybe you better make it for prices in that format only.