downer has asked for the wisdom of the Perl Monks concerning the following question:

I am faced with the following problem- given some snippets of text, specifically queries posed to a search engine, i would like to identify geographic entities with in the queries. this includes cities, states, etc. street would be ok too, but not necessary.
I realize that many towns, i.e. Springfield may be located in several states. At this point, i'm not concerned with this, I can resolve this later. I would just like to have code which identifies that the geographic term springfield within a query.
Do any monks have suggestions for doing this?
  • Comment on Best way to Extract Geographic Entities from Text?

Replies are listed 'Best First'.
Re: Best way to Extract Geographic Entities from Text?
by hossman (Prior) on Jan 15, 2009 at 19:40 UTC

    The Google Maps API provides Geocoding functionality. Given some input (which can be unstructured text) it will return as many possible geographic locations as it can find along with an accuracy (ie: confidence) rating for each.

    Some manual analysis of the results from "real world" sample input from your search should help you find a sweetspot where you can ignore any response from the geocode service that contains too many locations, or locations with too low of an accuracy.

Re: Best way to Extract Geographic Entities from Text?
by JavaFan (Canon) on Jan 15, 2009 at 18:20 UTC
    There was a presentation about Geocoding in the UK during the London Perl Workshop 2006.

    You may contact the author and see whether he has released anything.

Re: Best way to Extract Geographic Entities from Text?
by kennethk (Abbot) on Jan 15, 2009 at 18:12 UTC
    States are easier since you have a well-defined list (50 names + 50 postal abbreviations + a handful of alternate abbreviations (Mass, Miss, ...)). For cities or anything smaller, the only thing I can think of is catching capitalized words (proper nouns). You then need some lexical flag to differentiate locations from people's names - perhaps prepositions like on, in, at?

    Update: On further consideration, this is definitely a job for AI. I note a number of possibilities with search terms like AI, Bayes, and neural net, though a lot of it is labeled alpha.

Re: Best way to Extract Geographic Entities from Text?
by setebos (Beadle) on Jan 15, 2009 at 18:05 UTC
    If you miss the pattern, you won't be able to tell the computer what you want.
    Unless it's an AI application, which has an idea how to construct the semantic parser.
    What is the concrete question?
Re: Best way to Extract Geographic Entities from Text?
by eff_i_g (Curate) on Jan 15, 2009 at 22:30 UTC
    Are there any specific formats involved—e.g., "Chicago, IL", "Chicago, Illinois"—or is this open-ended—e.g., "chicagoland", "Ill.", etc.?
      because these are web queries, that is part of the problem.