Raytheon BBN has developed a new capability to search text and speech from low-resource languages using English search terms. The tool was developed and demonstrated over four years under the IARPA Machine Translation for English Retrieval of Information in Any Language, or MATERIAL, program.
The tool retrieves foreign documents and recordings that are relevant to a query in English and gives the user English translations of relevant phrases. This allows users to understand the context and meaning, without having to speak the language. The system used Kazakh, Pashto, Somali, Swahili and Tagalog to prove its efficacy for application against languages with little data for machine learning. In addition, it was tested against Farsi, Bulgarian, Lithuanian and Georgian.
“The system is designed to be applied to any foreign language,” said John Makhoul, Raytheon BBN program manager. “Low-resource languages present a particular challenge to retrieval and translation technologies because of a lack of data for training systems. Raytheon BBN met that challenge by developing techniques to overcome the issue of low data and applied them to an end-to-end system that exceeded the goals of the program.”
According to Carl Rubino, IARPA MATERIAL program manager, “The tools and techniques developed under the program will boost our ability to find, examine and analyze foreign language content without needing to learn the language. For low-resource languages where expertise is minimal, these new capabilities provide a significant advantage.”