About Me

My photo
Web person at the Imperial War Museum, just completed PhD about digital sustainability in museums (the original motivation for this blog was as my research diary). Posting occasionally, and usually museum tech stuff but prone to stray. I welcome comments if you want to take anything further. These are my opinions and should not be attributed to my employer or anyone else (unless they thought of them too). Twitter: @jottevanger

Monday, October 18, 2010

Open Culture 2010 ruminations #2: Europeana, UGC and the API, plus a bit of "what are we here for?"

OK here's a relatively quick one: there was lots of discussion at Open Culture 2010 about user-generated content, and (praise be!) lots about the API. Where I see a gap is in the link between these two.
Some background: Jill Cousins, Europeana's Director, outlined the four objectives that drive project/service/network/dream (take your pick), which go approximately like this:
  1. To Aggregate – bringing everything together in one place, interoperable, rich (or rich enough) and multilingual

  2. To Facilitate – to encourage innovation in digital heritage, to stimulate the digital economy, to bring greater understanding between the people of Europe, to build an amazing network of partners and friends

  3. To Distribute – code, data, content

  4. To Engage – to put the content into forms that engage people, wherever they may be and however they want to use it
Jill pointed out that there are multiple stakeholders with their own agendas, all of whom need serving. They aren’t always in conflict, but it’s our job and Europeana’s job to help to show them how their agendas actually align. We have users (that’s a pretty large and heterogeneous group!), content providers (only slightly less so), policy-makers, ministries, business users.... Having identified the value propositions for these groups it becomes clear where we need to fill some gaps in content, partners, functionality and marketing (the latter really hasn’t started yet).
A key plank in the distribution strategy is the API. For engagement, an emerging social strategy includes opportunities for users to react to and create content themselves, and to channel the content to external social sites (Facebook and the like). Both of these things are too big to go into here, but I think one thing we haven't got covered properly is the overlap between the two. Channeling content to people on 3rd party sites ticks the "distribution" box but whilst that in iteself may be engaging it is not the same as being "social" or facilitating UGC there. If people have to come to our portal to react they simply won't. In other words, our content API has to be accompanied by a UGC API - read and write. Even if the "write" part does nothing more than allow favouriting and tagging it will make it possible to really engage with Europeana from Facebook etc.
What falls out of this is my answer to one of Stefan Gradmann's questions to WP3 (the technical working party). Stefan asked, do we want to recommend that work progresses on authentication/authorization mechanisms (OAuth/OpenID, Shibboleth etc) for the Danube release (July 2011)? My answer is a firm "yes". Until that is sorted out we can't have a "social" API to support Europeana's engagement objective off-site, and if such interaction not possible off-site then we're really not making the most of the "distribution" strand either.

Open Culture 2010 ruminations #1: Linked Data

I just came back from the Europeana plenary conference, Open Culture 2010, in Amsterdam. Before the conference I went to meetings of Working Party 1 (Users) and WP3 (Technical), and at all three gatherings I found myself ruminating on a few key areas: the question of Linked Data and the API; how social media and user generated content relate to the distribution model for Europeana; and the future of the project itself. In this first post I'll look at Linked Data and why I think we need to worry less about some things and more about others that aren't getting much attention; and I'll suggest some analogies etc that we might use to help sell the idea a bit.



Linked (Open) Data was a constant refrain at the meetings (OK, not at the WP1 meeting) and the conference, and two things struck me. Firstly, there’s still lots of emphasis on creating out-bound links and little discussion of the trickier(?) issue of acting as a hub for inbound links, which to my mind is every bit as important. Secondly, there’s a lot of worry about persuading content providers that it’s the right thing to do. Now the very fact that it was a topic of conversation probably means that there really is a challenge there, and it’s worth then taking some time to get our ducks in a row so we can lay out very clearly to providers why it is not going to bring the sky crashing down on their heads.
During a brainstorming session on Linked Data, the table I sat with paid quite a lot of attention to this latter issue of selling the idea to institutions. The problem needs teasing apart, though, because it has several strands – some of which I think have been answered already. We were posed the questions “Is your institution technically ready for Linked Data” and “Does it have a business issue with LD?”, but we wondered if it’s even relevant if the institution is technically ready: Europeana’s technical ability is the question, and it can step into the breach for individual institutions that aren't technically ready yet. With regard to the "business issue" question, one wonders whether such issues are around out-going links, or incoming links? Then, for inbound linkage, is it the actual fact of linkage, or the metadata at the end of the link that are more likely to be problematic? And what are people’s worries about outbound links?
What we resolved it down to in the end was that we expected people would be most worried about (a) their content being “purloined”, and (b) links to poor-quality outside data sources. But how new are these worries? Not new at all, is the answer, and Linked Data really does nothing to make them more likely to be realised, when you think about what we already enable. In fact, there’s a case to be made that not only does LD increase business opportunities but it might also increase organisations’ control over “their” data, and improve the quality of things that are done with it: letting go of your data means people don’t do a snatch-and-grab instead.
Ultimately, I think, Linked Data really doesn’t need a sales effort of its own. If Europeana has won people over to the idea of an API and the letting-go of metadata that it implies, then Linked Data is nothing at all to worry about. What does it add to what the API and HTML pages already do? Two things:
  • A commitment to giving resources a URI (for all intents and purposes, read “stable URL”), which they should have for the HTML representation anyway. In fact, the HTML page could even be at that URI and either contain the necessary data as, say, RDFa in the HTML, or through content negotiation offer it in some purer data format (say, EDM-XML).
  • Links to other data sources to say “this concept/thing is the sameAs that concept/thing”. People or machines can then optionally either say “ah, I know what you mean now”, or go to that resource to learn more. Again, links are as old as the Web and, not to labour the point, are kinda implicit in its name.

So really there’s little reason to worry, especially if the API argument has already been put to bed. However I thought it might be an idea to list some ways in which we can translate the idea of LD so it’s less scary to decision-makers.

  • Remember the traditional link exchange? There’s nothing new in links, and once upon a time we used to try to arrange link exchanges like a babysitting circle or something. We desperately wanted incoming links, so where’s the reason in now saying, “we’re comfortable linking out, but don’t want people linking in to our data”?
  • Linked data as SEO. Organisations go to great lengths to optimise their sites so they fare well in search engine rankings. In other words, we already encourage Google, Bing and the like spider, copy and index our entire websites in the name of making them easier to discover. Now, search is fine, but it would be still better to let people use our content in more places (that’s what the API is about), and Linked Data acts like SEO for applications that could do that: if other resources link to ours, applications will “visit”.
    The other thing here is that we let search engines take our content for analysis, knowing they won’t use it for republication. We should also licence our content for complete ingestion so that applications indexing it can be as powerful as possible.
  • It’s already out there, take control! We let go of our content the moment we put it on the web, and we all know that doing that was not just a good thing, it’s the only right thing. But whilst the only way to use it is cut-n-paste (a) it’s not reused and seen nearly as much as it should be, and (b) it’s completely out of our control, lacking our branding and “authority”, and not feeding people back to us. Paradoxically, if we make it easier to reuse our content our way than it is to cut and paste, we can change this for the better: maintain the link with the rest of our content, keep intellectual ownership, drive people back to us. Helping reuse through linked data and APIs thus potentially gives us more control.
  • Get there first. There is no doubt that if we don’t offer our own records of our things in a reusable form online then bit by bit others will do it for us, and not in the way we might like. Wikipedia/DBPedia is filling up with records of artworks great and small, and will therefore be the reference URIs for many objects.
  • Your objects as context. Linked data lets us surround things/concepts with context;

So if I think fears about LD should something of a non-issue, what do I think are the more important questions we should be worrying about? Basically, it’s all about what’s at the end of the reference URI and what we can let people do with it. Again, it’s really a question as much about the API as it is about Linked Data, but it’s a question Europeana needs to bottom out. How we license the use of data we’re releasing from the bounds of our sites is going to become a hotter area of debate, I reckon, with issues like:

  • Is Europeana itself technically prepared to offer its contents as resources for use in the LD web? Are we ready to offer stable URIs and, where appropriate, indicate the presence of alternative URIs for objects?
  • What entities will Europeana do this for? Is it just objects (relatively simple because they are frequently unique), or is it for concepts and entities that may have URIs elsewhere?
  • What’s the right licence for simple reuse?
  • Does that licence apply to all data fields?
  • Does it apply to all providers’ data?
  • Does it apply to Europeana-generated enrichments?
  • Who (if anyone) gets the attribution for the data? The provider? Aggregator? Disseminator (Europeana)?
  • Do we need to add legal provisions for static downloads of datasets as opposed to dynamic, API-based use of data?

Just to expand a little on the last item, the current nature of semantic web (or SW-like) applications is that the tricky operation of linking the data in your system to that in another isn't often done on the fly: often it happens once and the results ingested and indexed. Doing a SPARQL query over datasets on opposite sides of the Atlantic is a slow business you don’t want to repeat for every transaction, and joining more sets than that is something to avoid. The implication of this is that, if a third party wanted to work with a graph that spread across Europeana and their own dataset, it might be much more practical for them to ingest the relevant part of the Europeana dataset and index and query it locally. This is in contrast to the on-the-fly usage of the metadata which I suspect most people have in mind for the API. Were we be allow data downloads we might wish to add certain conditions to what they could do with the data beyond using it for querying.

In short I think most of the issues around Linked Data and Europeana are just issues around opening the data full stop. LD adds nothing especially problematic beyond what an API throws up, and in fact it's a chance to get some payback for that because it facilitates inbound links. But we need to get our ducks in a row to show organisations that there's little to be worried about and a lot to gain from letting Europeana get on with it.

Sunday, October 03, 2010

Internet Archive and the URL shortener question

There's been a bit of chat lately about the risks of URL shorteners, prompted I think partly by the arrival of Google's goo.gl service. The Guardian covers the basic argument here but it's the obvious thing: what happens to all the short links that get circulated if a link shortener goes tits-up? (The Guardian quotes from Joshua Schachter's bloggage of last year, which has a lot more detail to think about.)

So ealier this week @tmtn tweeted from the Royal Society
"Penny-drop moment. If bit.ly goes belly up, all the links we've used it for, break."
and there followed a little exchange about what might be done to help - the conclusion being, I think, not a lot. For your own benefit you might export a list of your links as HTML or OPML (as you can from Delicious, which does link shortening now), but for whoever else has your links there's no help.

But it got me thinking about how the Internet Archive might fit in. Schachter mentions archiving the databases of link shortening services, and here's one home for them that really could help. Wouldn't it be cool if your favourite URL shortener hooked up with them so that every link they shortened was pushed into the IA index? It could be done live or after a few weeks delay, if necessary. The IA could then offer a permified version of the short URL, along the lines of
http://www.archive.org/surl/*/http://bit.ly/aujkzd [non-functioning entirely mythical link]
Knowing just the short URL it will be easy to find the original target URL (if it still exists!) There's also a nice bit of added value: the Wayback Machine, one of the Internet Archive's greatest pieces of self-preservation, snapshots sites periodically and the short links (being, hopefully, timestamped) could be tied to these snapshots, so that you could skip to how a page looked when the short link was minted. They might even find that the submission of short links was a guide to popularity they could use in selecting pages to archive.

OK, so the Internet Archive itself maybe isn't forever, but it's been around a while and looks good for a while longer, it's trusted, and it's neutral. Perhaps Bitly, Google, tr.im, TinyURL and all the rest might think about working with the IA so we can all feel a little more sanguine about the short links we're constantly churning out? It would certainly make me choose one provider over another, which is the sort of competitive differentiator they might take note of.