The Doofer Call: June 2011

....is not in a galaxy far far away, although I believe it came out of San Francisco so that may not be too much of a stretch. The recent LOD-LAM workshop there on the question of Linked Open Data in Libraries Archives and Museums seems to have been a very lively and stimulating event and resulted in, amongst other things, a star rating system for linked open cultural metadata.
Mia yesterday posted a question to the MCG list asking for reactions to the scheme, which addresses in particular the issue of rights and rights statements for metadata - both the nature of the licence, which must reach a minimum level of openness, and the publication of that licence/waiver. Specifically she asked whether the fact that even the minimum one-star rating required data to be available for both non-commercial and commercial use was a problem for institutions.

My reply was that I felt it essential, in order for it to count as linked data (so I'm very pleased to see it required for the most basic level of conformance). But here I'd like to expand on that a bit and also start to tease out a distinction that I think has been somewhat ignored: between the use of data to infer, reason, search, analyse, and the re-publication of data.

First, the commercial/non-commercial question. I suppose one could consider that as long as the data isn't behind a paywall or password or some other barrier then it's open, but that's not my view: I think that if it's restricted to a certain group of users then it's not open. Placing requirements on those users (e.g. attribution) is another matter; that's a limitation (and a pain, perhaps) but it's not closing the data off per se, whereas making it NC only is. Since the 4 different star levels in the LOD-LAM scheme all seem to reflect the same belief that's cool with me.

The commercial use question is a problem that has bedevilled Europeana in recent months, and so it is a very live issue in my mind. The need to restrict the use of the metadata to non-commercial contexts absolutely cripples the API's utility and undermines efforts to create a more powerful, usable, sustainable resource for all, and indeed to drive the creative economy in the way that the Europeana Commission originally envisaged. With a bit of luck and imagination this won't stay a problem for long, because a new data provider agreement will encourage much more permissive licences for the data, and in the meantime a subset of data with open licences (over 3M objects) has been partitioned off and was released this very week as Linked Open Data. Hurrah!

This brings us to the question of how LOD is used and whether we need a more precise understanding of how this might relate to the restrictions (e.g. non-commercial only) and requirements (e.g. giving attribution) that could be attached to data. I see two basic types of usage of someone else's metadata/content: publication e.g. displaying some facts from a 3rd party LOD source in your application; and reasoning with the data, whereby you may use data from 3rd party A to reach data from 3rd party B, but not necessarily republish any of the data from A.

If LOD sources used for reasoning have to be treated in the same way as those used for publication you potentially have a lot more complexity to deal with*, because every node involved in a chain of reasoning needs to be checked for conformance with whatever restrictions might apply to the consuming system. When a data source might contain data with a mixture of licences, so you have to check each piece of data, this is pretty onerous and will make developers think twice about following any links to that resource, so it's really important that aggregators like Culture Grid and Europeana can apply a single licence to a set of data.

If, on the other hand, licences can be designed that apply only to republication, not to reasoning, then client systems can use LOD without having to check that commercial use is permitted for every step along the way, and without having to give attribution to each source regardless of whether it’s published or not. I'm not sure that Creative Commons licences are really set up to allow for this distinction, although ODC-ODbL might be. Besides, if data is never published to a user interface, who could check whether it had been used in the reasoning process along the way? If my application finds a National Gallery record that references Pieter de Hooch’s ULAN record (just so that we’re all sure we’re talking about the same de Hooch), and I then use that identifier to query, say, the Amsterdam Museum dataset, does ULAN need crediting? Here ULAN is used only to ensure co-reference, of course. What if I used the ULAN record’s statement that he was active in Amsterdam between 1661-1684 to query DBPedia and find out what else happened in Amsterdam in the years that he was active there? I still don’t republish any ULAN data, but I use it for reasoning to find the data I do actually publish. At what point am I doing something that requires me to give attribution, or to be bound by restrictions on commercial use? Does the use of ULAN identifiers for co-reference bind a consuming system to the terms of use of ULAN? I guess not, but between this and republishing the ULAN record there’s a spectrum of possible uses.

Here's an analogy: when writing a book (or a thesis!), if one quotes from someone else's work they must be credited - and if it's a big enough chunk you may have to pay them. But if someone's work has merely informed your thinking, perhaps tangentially, and you don't quote them; or if perhaps you started by reading a review paper and end up citing only one of the papers it directed you to, then there's not the same requirement to either seek their permission to use their work, nor to credit them in the reference list. There's perhaps a good reason to try to do so, because it gives your own work more authority and credibility if you reference sources, but there's not a requirement - in fact it's sometime hard to find a way to give the credit you wish to someone who's informed your thinking! As with quotations and references, so with licensing data: attributing the source of data you republish is different to giving attribution to something that helped you to get somewhere else; nevertheless, it does your own credibility good to show how you reached your conclusions.

Another analogy: search engines already adopt a practical approach to the question of rights, reasoning and attribution. "Disallow: /" in a robots.txt file amounts to an instruction not to index and search (reason) and therefore not to display content. If this isn't there, then they may crawl your pages, reason with the data they gather, and of course display (publish) it in search results pages. Whilst the content they show there is covered by "fair use" laws in some countries, in others that’s not the case so there has occasionally been controversy about the "publication" part of what they do, and it has been known for some companies to get shirty with Google for listing their content (step forward, Agence France, for this exemplary foot-shooting). As far as attribution goes, one could argue that this happens through the simple act of linking to the source site. When it comes to the reasoning part of what search engines do, though, there's been no kerfuffle concerning giving attribution for that. No one minds not being credited for their part in the page rank score of a site they linked to – who pays it any mind at all? – and yet this is absolutely essential to how Google and co. work. To me, this seems akin to the hidden role that linked data sources can play in-between one another.

Of course, the “reasoning” problem has quite a different flavour depending upon whether you’re reasoning across distributed data sources or ingesting data into a single system and reasoning there. As Mia noted, the former is not what we tend to see at the moment. All of the good examples I know of digital heritage employing LOD actually use it by ingesting the data and integrating it into the local index, whether that's Dan Pett's nimble PAS work or Europeana's behemoth. But that doesn't mean that it's a good idea for us to build a model that assumes this will always be the case. Right now we're in the earliest stages of the LOD/semweb project really gathering pace - which I believe it finally is. People will do more ambitious things as the data grows, and the current pragmatic paradigm of identifying a data source that could be good for enriching your data and ingesting it into your own store where you can index it and make it actually scale may not stay the predominant one. It makes it hard to go beyond a couple of steps of inference because you can't blindly follow all the links you find in the LOD you ingest and ingest them too – you could end up ingesting the whole of the web of data. As the technology permits and the idea of making more agile steps across the semantic graph beds in I expect we'll see more solutions appear where reasoning is done according to what is found in various linked data sources, not according to what a system designer has pre-selected. As the chains of inference grow longer, the issue of attribution becomes keener, and so in the longer term there will be no escaping the need to be able to reason without giving attribution.

This is the detail we could do with ironing out in licencing LOD, and I’d be pleased to see it discussed in relation to the LOD-LAM star scheme.

I spent today at the London edition of Hack4Europe, held at the British Library. It was co-hosted by the BL, Culture Grid/Collections Trust and Europeana, and it one of 4 such hackdays around Europe in a single week all aiming to give the Europeana APIs a good work-out and to uncover new ideas for how they can be used. It was a really fun and interesting day, although for me it ended in gentle humiliation as I couldn't get my laptop to output to the projector. Compared to my previous attempts at days like this it had all gone swimmingly up till then, so to fall at the last hurdle was a bummer! There were lots of very creative, clever (and funny) ideas on going around and you should keep your eyes open because some may come to fruition in due course, but right now I'm going to indulge myself and talk mainly about what I attempted, because my presentation was a total #fail. So this is not really much of a report at all. Better luck next time.

oEmbed for Europeana
I took with me a few things I'd worked on already and some ideas of what I wanted to expand. One that I'd got underway involved oEmbed.
If you haven't come across it before, oEmbed is a protocol and lightweight format for accessing metadata about media. I've been playing with it recently, weighing it up against MediaRSS, and it really has its merits. The idea is that you can send a URL of a regular HTML page to an oEmbed endpoint and it will send you back all you need to know to embed the main media item that's on that page. Flickr, YouTube and various other sites offer it, and I'd been playing with it as a means of distributing media from our websites at IWM. Its main advantages are that it's lightweight, usually available as JSON (ideally with callbacks, to avoid cross-domain issues), and most importantly of all, that media from many different sites are presented in the same form. This makes it
easier to mix them up. MediaRSS is also cool, holds multiple objects (unlike oEmbed), and is quite widespread.
I've made a javascript library that lets you treat MediaRSS and oEmbed the same so you can mix media from lots of sources as generic media objects, which seemed like a good starting point for taking Europeana content (or for that matter IWM content) and contextualising it with media from elsewhere. The main thing missing was an oEmbed service for Europeana. What you have instead is an OpenSearch feed (available as JSON, but without the ability to return a specific item, and without callbacks) and a richer SRW record for individual items. This is XML only. Neither option is easily mapped to common media attributes, at least not to the casual developer, so before the hackday I knocked together a simple oEmbed service. You send it the URL of the item you like on Europeana, it sends back a JSON representation of the media object (with callback, if specified), and you're done.(Incidentally I also made a richer representation using Yahoo! Pipes, which meant that the SRW was available as JSON too.)

Using the oEmbed
With a simple way of dealing with just the most core data in the Europeana record, I was then in a position to grab it "client-side" with the magic of jQuery. I'm still in n00b status with this but getting better, so I tried a few things out.
Inline embedding
First, I put simple links to regular Europeana records onto an HTML page, gave them a class name to indicate what they were, and then used jQuery to gather these and get the oEmbed. This was used to populate a carousel (too ugly to link to). An alternative also worked fine: adding a class and "title" tag to other elements. Kind of microformatty. Putting YouTube and Flickr links on the same page then results in a carousel that mixes of all of them up.
Delicious collecting
Then I bookmarked a bunch of Europeana records into Delicious and tagged them with a common tag (in my case, europeanaRecord). I also added my own note to each bookmark so I could say why I picked it. With another basic HTML page (no server-side nonsense for this) I put jQuery to work again to:

grab the feed for my tag as JSON

submit each link in the feed to my oEmbed service

add to the resulting javascript object (representing a media object) a property to hold the note I put with my bookmark

put all of these onto another pig-ugly page*, and optionally assemble them into a carousel (click the link at the top). When you get half a dozen records or more this is worthwhile. This even uglier experiment shows the note I added in Delicious attached to the item from Europeana, on the fly in your browser.

I could have held the bookmarks elsewhere, of course - say in a GoogleDocs spreadsheet, or maybe Zotero - but Delicious is my day-to-day bookmarking application and it's very convenient to collect stuff from a button in my browser toolbar. Adding new tags to put together new collections is easy too.

I suppose what I was doing was test driving use-cases for two enhancements to Europeana's APIs. The broader was about the things one could do if and when there is a My Europeana API. My Europeana is the user and community part of the service, and at some point one would hope that things that people collect, annotate, tag, upload etc will be accessible through a read/write API for reuse in other contexts. Whilst waiting for a UGC API, though, I showed myself that one can use something as simple as Delicious to do the collecting and add some basic UGC to it (tags and note). The narrower enhancement would be an oEmbed service, and oddly I think it's this narrower one that came out stronger, because it's so easy to see how it can help even duffer coders like me in mixing up content from multiple sources.

I didn't
What I didn't manage to do, which I'd hoped to try, was hook up bookmarking somehow with the Mashificator, which would complete the circle quite nicely, or get round to using the enriched metadata that Europeana has now made available including good period and date terms, lots of geo data, and multilingual annotations. These would be great for turning a set of Delicious bookmarked records into a timeline, a map, a word-cloud etc. Perhaps that's next. And finally, it would be pretty trivial to create oEmbed services for various other museum APIs I know and to make mixing up their collection on your page as easy as this, with just a bit of jQuery and Delicious.

Working with Jan
Earlier in the day I spent some time working with Jan Molendijk, Europeana's Technical Director, working on some improvements to a mechanism he's built for curating search results and outputting static HTML pages. It's primarily a tool for Europeana's own staff but I think we improved the experience of assembling/curating a set, and again I got to strech my legs a little with jQUery, learning all the time. He decided to use Delicious too to hold searches, which themselves can be grouped by tags and assembled into super-sets of sets. It was a pleasure and a privilege to work with the driving force behind Europeana's technical team; who better to sit by than the guy responsible for the API?

*actually this one uses the Yahoo! Pipe coz the file using the oEmbed is a bit of a mess still but it does the same thing

The Doofer Call

About Me

Saturday, June 25, 2011

The LOD-LAM star system....

Thursday, June 09, 2011

Hack4Europe London, by your oEmbedded reporter