in reply to regex negative lookahead behaviour

I'm parsing snail mail addresses, particularly for horrible beasts like "456 4 1/2 MILE RD".

Meta advice:

I've been involved with two projects over the years that have had to parse names and addresses. In both cases, what we ended up with was a system that got ~98% right automatically, then kicked out the remaining 2% for vetting by a human.

It's a diminishing returns problem: If you model the economics, at some point you have to stop pouring your effort into matching the remaining pool of difficult addresses, and let a human being do it.

Replies are listed 'Best First'.
Re: Re: regex negative lookahead behaviour
by BazB (Priest) on Jul 20, 2003 at 11:19 UTC

    Unfortunately, dws's approach isn't always possible.

    The system I work on handles significant volumes of addresses, using dedicated (commercial) software to handle the identification and validation of that information.
    There is inhouse processing to help smooth things out, but not every problem can be catered for.

    The volumes involved prevent it being practial for humans to process the problem records (and in fact some of those problem records are a direct result of human input).

    Even if the volumes were low enough for humans to be able to process exceptions, humans can't get it right all of the time.
    This might be because of a lack of information, poorly laid of information or just human error.

    As shemp describes, even reference data used in such validation systems isn't perfect, and this sets the upper limit to what you can reasonably expect to acheieve.

    I'd say that you can never expect to get things 100% right, and it might end up being cheaper and/or easier to accept a certain error rate.
    Of course, your client may not accept this, but that's a whole other problem :-)

    Cheers.

    BazB


    If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
    That way everyone learns.