Crowdsourcing WALS using Linked Data
The World Atlas of Language Structures project (http://wals.info) is one of the landmarks of digital linguistics. It contains 192 features in 2678 languages. However, the resulting data matrix is very sparse, and instead of the possible 514176 datapoints, there are only about 68000, or 13%.
The database is currently hosted at the Max Planck Institute for Evolutionary Anthropology in Leipzig, and while there are regular updates, there are no plans to open the database to the public. This is understandable given the reputation WALS has acquired over the past years and the security issues in providing write access to a database.
At the same time, the WALS team regularly receive requests by people who want to add information about a certain language or a certain feature. These requests can normally not be honoured as no processes are in place to accommodate them.
The issues at stake are thus security, quality control, and provenance. This can be taken care of by taken a distributed approach. Scientists who want to contribute datapoints to WALS can do this on their own web space, and the datapoints are subsequently harvested. WALS 'core' would still be the curated version hosted at the MPI, while WALS 'community' would contain additional datapoints. Provenance metadata will help in gauging which datapoints to trust.
In this blog post, I will outline a possible structure and workflow for WALS 'community'.
Distributed hosting of resources is one of the key concepts of the semantic web. By using common description formats and ontologies, the resources become interoperable. The working group on Open Data in Linguistics of the Open Knowledge Foundation recommends using RDF as a standard for interoperable resources and is currently working on the creation of the Linguistic Linked Open Data Cloud (also see the upcoming MLODE workshop).
These efforts can be exploited for WALS 'community'. Basing myself on work by Kingsley Idehen, all you need is a Dropbox account and a text editor. The long version can be found here, the short version is:
- copy the following fragment in an editor of your choice and replace '9A' and 'jbt' by the feature and the language of your datapoint
- save to the 'Public' folder of your Dropbox, adjusting the name to the feature and language of your choice
- review on http://linkeddata.uriburner.com/about/html/https/dl.dropbox.com/u/31481215/wals-jbt-9a.ttl (adjust file name)
- add your filename to https://docs.google.com/spreadsheet/ccc?key=0Apb_EoY8u4imdDRqWkY3QWFQT2hydGY1TFJYV2tIUGc
- Your data are now ready to be harvested
You can of course automate this process with appropriate scripting tools.
## Paste this into an empty file
## Turtle Content Start ##
## Prefix Declaration Section -- you don't need to touch this
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix datapoint: <http://wals.info/datapoint/> .
@prefix value: <http://wals.info/value/> .
@prefix lingtyp: <http://www.galoes.org/ontologies/lingtyp.owl#> .
@prefix dcterms: <http://purl.org/dc/elements/1.1/> .
@prefix glottoref: <http://www.glottolog.org/resource/reference/id/> .
<> a lingtyp:Datapoint .
### start here
### edit the lines below which start with <>
### Replace 9A with the WALS feature you are describing
### Replace jbt with the WALS code of the language you are describing
### Replace f9a-3 with the value of the datapoint
### replace 'Sebastian Nordhoff' with your name
### replace the ISBN if you have a source with ISBN, delete otherwise
### check whether http://glottolog.org/langdoc has the source you are using and replace the id if it does, delete otherwise
### add your dropbox ID to https://docs.google.com/spreadsheet/ccc?key=0Apb_EoY8u4imdDRqWkY3QWFQT2h...
### add your filename to https://docs.google.com/spreadsheet/ccc?key=0Apb_EoY8u4imdDRqWkY3QWFQT2h...
### you can view the Linked Data version of your datapoint on http://linkeddata.uriburner.com/about/html/https/dl.dropbox.com/u/314812...
#editing starts here
<> rdfs:label "WALS datapoint 9A-jbt" .
<> lingtyp:hasValue value:f9a-3 .
<> dcterms:creator "Sebastian Nordhoff" .
<> dcterms:references <urn:isbn:0-521-57021-2> .
<> dcterms:references glottoref:r89561 .
There is an implementation available at http://www.glottotopia.de/cswals . After uploading a spreadsheet, you are offered your rdf files in a zip archive for download and extraction at your favorite hosting service. I tried to limit the need for manual post-editing; the only thing which still has to be changed in the files is the address of your hosting service.
I am no too sure how to register the content either. datahub.io would be the best place I guess, but how many pure linguists would register there?