Re: Efficient Fuzzy Matching Of An Address

We geo code all addresses then use a radius as the first filter then apply other heuristics to filter after that.

Geo coding raises other issues. For example, some geo coding engines return multiple lat/long pairs (for one address).

We deal with only about 2,500 addresses from 4 data sources with quarterly updates and about a 10% turnover rate. We use human review. We set a strict "similarity" level then generate a list of match candidates. Humans review this list and evaluate as a match or non-match. We then loosen the similarity requirement and generate another candidate list (but filter any non-matches found by the first filter). That candidate list is evaluated by humans. Loosen, (rinse) and repeat.

After matching comes more fun, merging. Put another way, once you have a match that is not exact, how do you reconcile the differences? Given two differences, which is right?

Comment on Re: Efficient Fuzzy Matching Of An Address

Replies are listed 'Best First'.
Re^2: Efficient Fuzzy Matching Of An Address by Limbic~Region (Chancellor) on Aug 20, 2008 at 12:32 UTC
jimX11, I assume that if your geo coding fails, you have a human do review? In other words, someone provides an invalid address due to a typo (for instance). That is exactly what I am doing. I am just trying to make the human's job easier. I want them to pull out the most likely addresses that the input should have been. Cheers - L~R	[reply]
Re^3: Efficient Fuzzy Matching Of An Address by jimX11 (Friar) on Aug 20, 2008 at 20:46 UTC
Yea, our geocoding process involves human review. Properties that fail to geocode are reviewed and most often the address is dirty and needs human cleaning. Some geocoding engines assign what I view as a geo-coding confidence level. Properties that fall below a certain threshold are reviewed also. The candidate lists come from sql statements generated by perl code. Just for kicks, here's an example of the sql (that doesn't use a lat/long filter): insert into match_candidates (row1_id,row2_id,strat_name,status) select t1.row_id, t2.row_id, 'name_address_city_zip_county_units_weak_improved_trim_v5', case when t1.shim_id = t2.shim_id then 'good' else 'bad' end from merge_list t1, merge_list t2 where t1.row_id < t2.row_id -- looking for matches from the AHI-Doug source to some -- other source and ( (t1.source = 'AHI-Doug' or t2.source = 'AHI-Doug') and (t1.sourc +e != t2.source) ) -- no duplicate matches from known phase things -- this can be weakened to show no previous -- matches from _any_ stratagy and ( not exists ( select * from match_candidates m where t1.row_id = m.row1_id and t2.row_id = m.row2_id ) ) -- strategy layers start here (assume at least one) AND -- similar name (weak) t1.name is not null and t2.name is not null and ( regexp_replace(lower(t1.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.name),'[^a-z ]+','','g') or regexp_replace(lower(t2.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.name),'[^a-z ]+','','g') ) AND -- similar address (weak) t1.address is not null and t2.address is not null and ( regexp_replace(lower(t1.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t2.address),'[^a-z ]+','','g') or regexp_replace(lower(t2.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t1.address),'[^a-z ]+','','g') ) AND -- similar city (weak) t1.city is not null and t2.city is not null and ( regexp_replace(lower(t1.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.city),'[^a-z ]+','','g') or regexp_replace(lower(t2.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.city),'[^a-z ]+','','g') ) [download]	[reply] [d/l]