MIT has announced its first database of annotated content by non-native English speakers, which it is believed will result in better machine learning as a result of computers having more comprehensive natural language processing capabilities.
With over 1 billion speakers, English is the dominant language on the Internet. But the truth is that for a huge number of these people, English is their second language.
“Most of the people who speak English in the world or produce English text are non-native speakers,” says project leader Yevgeni Berzak. “This characteristic is often overlooked when we study English scientifically or when we do natural language processing for English.”
The endeavour, overseen by MIT’s Centre for Brains, Minds and Machines, comprises over 5,000 sentences from exam papers taken by ESL (English as a second language) students. One of the takeaways of the project was the potential for more accurate grammar correction software; a real boon to anyone who finds those squiggly green lines infuriating.
“The decision to annotate both incorrect and corrected sentences makes the material very valuable,” says Joakim Nivre, Professor of Computational Linguistics at Sweden’s Uppsala University. “I can see, for example, how this could be cast as a machine translation task, where the system learns to translate from ESL to English… the availability of syntactic annotation for both sides opens up more diverse technical approaches.”
The team will present their findings at the annual Computational Linguistics Association conference this month.