Details, Details, Details

Jim Cowie

Computing Research Laboratory

New Mexico State University


Gazetteers are a great resource for natural language processing of texts containing place references. Bi-lingual gazetteers are needed for cross-language information retrieval, machine translation, information extraction, and cross document summarization. Place names recognized correctly allow all the above processes to produce better quality output and can provide clues to an end user which clarify the source and topic of a document which has been processed automatically.

Bi-lingual gazetteers, however, are scarce. Our approach has been to translate our English gazetteer, largely based on the Tipster gazetteer, by automatic transliteration and then to repair the names that are historically different in a particular language. For all languages transliteration is only a partial solution as many factors influence the choice of characters to use in the mapping. Most human translators don't use adhere to any particular standard, relying instead on their common sense to perform the mapping.

There is a potential N*N set of translation pairs needed to produce bi-lingual gazetteers. It would make sense to map all gazetteers to some common representation. This is particularly attractive as the semantics of places, at least at this level, is relatively simple.

A combination of Name, Containing Names, Feature Type, and Co-ordinates would be enough. Given two national NLP gazetteers mapped to this resource we can generate all the resources needed for NLP applications.

This talk will look at what is currently available for multi-lingual geographic name processing, sadly, less than you might expect, and suggest a strategy for growing a common resource that would be off use in automatic, and semi-automatic processing of all types.