Re^2: Efficient Fuzzy Matching Of An Address

jimX11,
I assume that if your geo coding fails, you have a human do review? In other words, someone provides an invalid address due to a typo (for instance).

That is exactly what I am doing. I am just trying to make the human's job easier. I want them to pull out the most likely addresses that the input should have been.

Cheers - L~R

Comment on Re^2: Efficient Fuzzy Matching Of An Address

Replies are listed 'Best First'.
Re^3: Efficient Fuzzy Matching Of An Address by jimX11 (Friar) on Aug 20, 2008 at 20:46 UTC
Yea, our geocoding process involves human review. Properties that fail to geocode are reviewed and most often the address is dirty and needs human cleaning. Some geocoding engines assign what I view as a geo-coding confidence level. Properties that fall below a certain threshold are reviewed also. The candidate lists come from sql statements generated by perl code. Just for kicks, here's an example of the sql (that doesn't use a lat/long filter): insert into match_candidates (row1_id,row2_id,strat_name,status) select t1.row_id, t2.row_id, 'name_address_city_zip_county_units_weak_improved_trim_v5', case when t1.shim_id = t2.shim_id then 'good' else 'bad' end from merge_list t1, merge_list t2 where t1.row_id < t2.row_id -- looking for matches from the AHI-Doug source to some -- other source and ( (t1.source = 'AHI-Doug' or t2.source = 'AHI-Doug') and (t1.sourc +e != t2.source) ) -- no duplicate matches from known phase things -- this can be weakened to show no previous -- matches from _any_ stratagy and ( not exists ( select * from match_candidates m where t1.row_id = m.row1_id and t2.row_id = m.row2_id ) ) -- strategy layers start here (assume at least one) AND -- similar name (weak) t1.name is not null and t2.name is not null and ( regexp_replace(lower(t1.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.name),'[^a-z ]+','','g') or regexp_replace(lower(t2.name),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.name),'[^a-z ]+','','g') ) AND -- similar address (weak) t1.address is not null and t2.address is not null and ( regexp_replace(lower(t1.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t2.address),'[^a-z ]+','','g') or regexp_replace(lower(t2.address),'[^a-z ]+','','g') ~ regexp_re +place(lower(t1.address),'[^a-z ]+','','g') ) AND -- similar city (weak) t1.city is not null and t2.city is not null and ( regexp_replace(lower(t1.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t2.city),'[^a-z ]+','','g') or regexp_replace(lower(t2.city),'[^a-z ]+','','g') ~ regexp_repla +ce(lower(t1.city),'[^a-z ]+','','g') ) [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Efficient Fuzzy Matching Of An Address
by jimX11 (Friar) on Aug 20, 2008 at 20:46 UTC

Yea, our geocoding process involves human review. Properties that fail to geocode are reviewed and most often the address is dirty and needs human cleaning. Some geocoding engines assign what I view as a geo-coding confidence level. Properties that fall below a certain threshold are reviewed also.

The candidate lists come from sql statements generated by perl code.

Just for kicks, here's an example of the sql (that doesn't use a lat/long filter):

insert into match_candidates (row1_id,row2_id,strat_name,status)
select t1.row_id,
       t2.row_id,
       'name_address_city_zip_county_units_weak_improved_trim_v5',
        case
         when t1.shim_id = t2.shim_id then 'good'
         else 'bad'
        end
from merge_list t1, merge_list t2
where t1.row_id < t2.row_id

-- looking for matches from the AHI-Doug source to some
-- other source
and (
      (t1.source = 'AHI-Doug' or t2.source = 'AHI-Doug') and (t1.sourc
+e != t2.source)
)


-- no duplicate matches from known phase things
-- this can be weakened to show no previous
-- matches from _any_ stratagy

and ( not exists (
                    select *
                    from match_candidates m
                    where t1.row_id = m.row1_id
                      and t2.row_id = m.row2_id
                  )
      )

-- strategy layers start here (assume at least one)
AND


-- similar name (weak)
    t1.name is not null
and t2.name is not null
and ( regexp_replace(lower(t1.name),'[^a-z ]+','','g') ~  regexp_repla
+ce(lower(t2.name),'[^a-z ]+','','g')
      or 
      regexp_replace(lower(t2.name),'[^a-z ]+','','g') ~  regexp_repla
+ce(lower(t1.name),'[^a-z ]+','','g')
)

 AND 
-- similar address (weak)
    t1.address is not null
and t2.address is not null
and ( regexp_replace(lower(t1.address),'[^a-z ]+','','g') ~  regexp_re
+place(lower(t2.address),'[^a-z ]+','','g')
      or 
      regexp_replace(lower(t2.address),'[^a-z ]+','','g') ~  regexp_re
+place(lower(t1.address),'[^a-z ]+','','g')
)

 AND 
-- similar city (weak)
    t1.city is not null
and t2.city is not null
and ( regexp_replace(lower(t1.city),'[^a-z ]+','','g') ~  regexp_repla
+ce(lower(t2.city),'[^a-z ]+','','g')
      or 
      regexp_replace(lower(t2.city),'[^a-z ]+','','g') ~  regexp_repla
+ce(lower(t1.city),'[^a-z ]+','','g')
)
[download]

[reply]
[d/l]