Abney & Bird's Grand Challenge: The Human Language Project
Steven Abney and Steven Bird published a provocative paper (.pdf) at ACL 2010 calling on the computational linguistics community to work to create a "Universal Corpus", an undertaking that they compare in both scale and potential impact to the Human Genome Project. Here is the abstract:
We present a grand challenge to build a corpus that will include all of the world’s languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. We propose that the ability to train systems to translate into and out of a given language be the yardstick for determining when we have successfully captured a language. We call on the computational linguistics community to begin work on this Universal Corpus, pursuing the many strands of activity described here, as their contribution to the global effort to document the world’s linguistic heritage before more languages fall silent.
Will the community take up this challenge? Will the linguistics and computational linguistics communities succeed in working together on it? It seems to me that neither community could do it alone, but it will take better communication between the two fields than we have at present to achieve.