Data provenance and data aggregation
Peter Austin, over at Endangered Languages and Cultures, has initiated a discussion on citation practices (with James McElvenny also participating), and it was prompted (at least partly) by some data I have had a role in processing as part of the LEGO project.
He raises a number of important issues, especially relating to making sure that language documenters (and speakers, potentially) feel that they are getting appropriate credit for their work, and I thought it might be worthwhile to describe here how the problems he identified with some of the data being processed by LEGO arose and use this discussion to pose some more general questions relating to data aggregation.
Part of the LEGO project has involved conversion of a large legacy database of wordlists covering at least a couple of thousand languages. This database has been created by Timothy Usher and Paul Whitehouse, and the LEGO project was working with a version of the database that can be dated to around 2006. (Timothy Usher has his own newer version of the database, and I can help people contact him if they are interested in learning about it.) Our goal in this conversion process has been (i) to use it to develop and test a interoperable format for wordlist data and (ii) to allow this substantial resource to serve as a useful comparative dataset to illustrate the potential power of the format as well as for more general research.
We had access to the original wordlists in the form of Excel spreadsheets (though I believe this itself was a conversion from a ClarisWorks format) with the characters encoded in a non-Unicode font. The spreadsheet format did not allow detailed encoding of metadata, but, in some cases, an author or author-year citation was given at the top of a column of forms drawn from a specific wordlist.
Clearly, such information is not an ideal citation. A full reference would be be good and, even better, would be page numbers (or equivalent) for each form. The lack of such citation, I should emphasize, was not due to the fact that the data collectors were not interested in citation. Rather, the spreadsheet software they were using did not make it straightforward to both include full citation information in their database and be able to access the data in easy-to-inspect tables. The fact that there are tools which would allow this is more or less irrelevant: This data collection was done without sufficient resources to include someone with database expertise. Best practice wasn't a feasible option.
LEGO's approach to the dataset has been: convert the data first in a way that represents the original content and, after that has been done, see if the data can be enriched later. Where we have citation information, we include this in our (OLAC) metadata using a provenance tage along the lines of:
<dcterms:provenance>Hercus & Austin, 2004</dcterms:provenance>
In doing this, we have been operating on the principle that conversion of legacy data should precede enriching it, from which we derive the policy of only including as much citation information as was available in the original resource. We then implement this policy using text strings in a citation portion of the metadata.
What interests me about Peter Austin's concerns at present is what burden we want put on data conversion/aggregation projects like LEGO with respect to citation when the original resource falls short of best practice (in this case due to technological limitations, not creator carelessness). My attitude has been, let's convert this material now so other people can use it and so that it's in a better format to enrich it going forward. I think this is reasonable for these wordlists created by Usher and Whitehouse since, in my interactions with them, it has been clear that they never intended to present their dataset as not making use of other people's materials. Technology was the problem, not people.
At the same time, I can imagine scenarios where a dataset might be so obviously problematic (or even unethical) that a data aggregator might have to refuse it. And, then, there are borderline cases—for instance what if the Usher and Whitehouse materials lacked any kind of citation whatsoever? Should they not be included in an aggregator at all? Who determines how to deal with borderline cases? When does resource exclusion start to move towards the realm of censorship?
Of course, all of this discussion leaves many issues open since it focuses on aggregation, which is, at least as of now, a relatively minor issue in the field compared to all the individual scholars who need to make detailed citation decisions every time they write a paper. I've also completely left out the issue of copyright, since that's a whole other very large and ugly can of worms.