The Doofer Call: The LOD-LAM star system....

....is not in a galaxy far far away, although I believe it came out of San Francisco so that may not be too much of a stretch. The recent LOD-LAM workshop there on the question of Linked Open Data in Libraries Archives and Museums seems to have been a very lively and stimulating event and resulted in, amongst other things, a star rating system for linked open cultural metadata.
Mia yesterday posted a question to the MCG list asking for reactions to the scheme, which addresses in particular the issue of rights and rights statements for metadata - both the nature of the licence, which must reach a minimum level of openness, and the publication of that licence/waiver. Specifically she asked whether the fact that even the minimum one-star rating required data to be available for both non-commercial and commercial use was a problem for institutions.

My reply was that I felt it essential, in order for it to count as linked data (so I'm very pleased to see it required for the most basic level of conformance). But here I'd like to expand on that a bit and also start to tease out a distinction that I think has been somewhat ignored: between the use of data to infer, reason, search, analyse, and the re-publication of data.

First, the commercial/non-commercial question. I suppose one could consider that as long as the data isn't behind a paywall or password or some other barrier then it's open, but that's not my view: I think that if it's restricted to a certain group of users then it's not open. Placing requirements on those users (e.g. attribution) is another matter; that's a limitation (and a pain, perhaps) but it's not closing the data off per se, whereas making it NC only is. Since the 4 different star levels in the LOD-LAM scheme all seem to reflect the same belief that's cool with me.

The commercial use question is a problem that has bedevilled Europeana in recent months, and so it is a very live issue in my mind. The need to restrict the use of the metadata to non-commercial contexts absolutely cripples the API's utility and undermines efforts to create a more powerful, usable, sustainable resource for all, and indeed to drive the creative economy in the way that the Europeana Commission originally envisaged. With a bit of luck and imagination this won't stay a problem for long, because a new data provider agreement will encourage much more permissive licences for the data, and in the meantime a subset of data with open licences (over 3M objects) has been partitioned off and was released this very week as Linked Open Data. Hurrah!

This brings us to the question of how LOD is used and whether we need a more precise understanding of how this might relate to the restrictions (e.g. non-commercial only) and requirements (e.g. giving attribution) that could be attached to data. I see two basic types of usage of someone else's metadata/content: publication e.g. displaying some facts from a 3rd party LOD source in your application; and reasoning with the data, whereby you may use data from 3rd party A to reach data from 3rd party B, but not necessarily republish any of the data from A.

If LOD sources used for reasoning have to be treated in the same way as those used for publication you potentially have a lot more complexity to deal with*, because every node involved in a chain of reasoning needs to be checked for conformance with whatever restrictions might apply to the consuming system. When a data source might contain data with a mixture of licences, so you have to check each piece of data, this is pretty onerous and will make developers think twice about following any links to that resource, so it's really important that aggregators like Culture Grid and Europeana can apply a single licence to a set of data.

If, on the other hand, licences can be designed that apply only to republication, not to reasoning, then client systems can use LOD without having to check that commercial use is permitted for every step along the way, and without having to give attribution to each source regardless of whether it’s published or not. I'm not sure that Creative Commons licences are really set up to allow for this distinction, although ODC-ODbL might be. Besides, if data is never published to a user interface, who could check whether it had been used in the reasoning process along the way? If my application finds a National Gallery record that references Pieter de Hooch’s ULAN record (just so that we’re all sure we’re talking about the same de Hooch), and I then use that identifier to query, say, the Amsterdam Museum dataset, does ULAN need crediting? Here ULAN is used only to ensure co-reference, of course. What if I used the ULAN record’s statement that he was active in Amsterdam between 1661-1684 to query DBPedia and find out what else happened in Amsterdam in the years that he was active there? I still don’t republish any ULAN data, but I use it for reasoning to find the data I do actually publish. At what point am I doing something that requires me to give attribution, or to be bound by restrictions on commercial use? Does the use of ULAN identifiers for co-reference bind a consuming system to the terms of use of ULAN? I guess not, but between this and republishing the ULAN record there’s a spectrum of possible uses.

Here's an analogy: when writing a book (or a thesis!), if one quotes from someone else's work they must be credited - and if it's a big enough chunk you may have to pay them. But if someone's work has merely informed your thinking, perhaps tangentially, and you don't quote them; or if perhaps you started by reading a review paper and end up citing only one of the papers it directed you to, then there's not the same requirement to either seek their permission to use their work, nor to credit them in the reference list. There's perhaps a good reason to try to do so, because it gives your own work more authority and credibility if you reference sources, but there's not a requirement - in fact it's sometime hard to find a way to give the credit you wish to someone who's informed your thinking! As with quotations and references, so with licensing data: attributing the source of data you republish is different to giving attribution to something that helped you to get somewhere else; nevertheless, it does your own credibility good to show how you reached your conclusions.

Another analogy: search engines already adopt a practical approach to the question of rights, reasoning and attribution. "Disallow: /" in a robots.txt file amounts to an instruction not to index and search (reason) and therefore not to display content. If this isn't there, then they may crawl your pages, reason with the data they gather, and of course display (publish) it in search results pages. Whilst the content they show there is covered by "fair use" laws in some countries, in others that’s not the case so there has occasionally been controversy about the "publication" part of what they do, and it has been known for some companies to get shirty with Google for listing their content (step forward, Agence France, for this exemplary foot-shooting). As far as attribution goes, one could argue that this happens through the simple act of linking to the source site. When it comes to the reasoning part of what search engines do, though, there's been no kerfuffle concerning giving attribution for that. No one minds not being credited for their part in the page rank score of a site they linked to – who pays it any mind at all? – and yet this is absolutely essential to how Google and co. work. To me, this seems akin to the hidden role that linked data sources can play in-between one another.

Of course, the “reasoning” problem has quite a different flavour depending upon whether you’re reasoning across distributed data sources or ingesting data into a single system and reasoning there. As Mia noted, the former is not what we tend to see at the moment. All of the good examples I know of digital heritage employing LOD actually use it by ingesting the data and integrating it into the local index, whether that's Dan Pett's nimble PAS work or Europeana's behemoth. But that doesn't mean that it's a good idea for us to build a model that assumes this will always be the case. Right now we're in the earliest stages of the LOD/semweb project really gathering pace - which I believe it finally is. People will do more ambitious things as the data grows, and the current pragmatic paradigm of identifying a data source that could be good for enriching your data and ingesting it into your own store where you can index it and make it actually scale may not stay the predominant one. It makes it hard to go beyond a couple of steps of inference because you can't blindly follow all the links you find in the LOD you ingest and ingest them too – you could end up ingesting the whole of the web of data. As the technology permits and the idea of making more agile steps across the semantic graph beds in I expect we'll see more solutions appear where reasoning is done according to what is found in various linked data sources, not according to what a system designer has pre-selected. As the chains of inference grow longer, the issue of attribution becomes keener, and so in the longer term there will be no escaping the need to be able to reason without giving attribution.

This is the detail we could do with ironing out in licencing LOD, and I’d be pleased to see it discussed in relation to the LOD-LAM star scheme.

The Doofer Call

About Me

Saturday, June 25, 2011

The LOD-LAM star system....

No comments: