Well Mia's interest in what thesauri, word-lists etc. are out there, or could be out there, in machine-friendly form chimed nicely with mine, and it had been grating at me for ages that, for example, the NMR object type thesaurus is only available as HTML, not as a web service. There are a bunch of other thesauri in HTML form on the Collections Trust (well, MDA), English Heritage, and FISH sites, so following Mia's recent attempts to prod some of us museum tech types to action on the API front I figured I may as well have a go at turning one of them into a web service. The long and short is I haven't managed, but I have made useful steps, I think, and learnt a fair bit about Dapper, Pipes, and SKOS along the way.
I took the British Museum's material thesaurus, which is hosted by CT here. I went to Dapper and tried to get it to learn well enough to go straight to nice XML with all the different relationships having their own elements. There were too many exceptions for that and it stopped learning them after a while and I was going in circles I'd never escape, so I made a simpler Dapp (here) which just puts out the term, the linked terms, and comments. I later had to retrain it to cope with the H page but since running that page correctly once it's refused to again: it shows the results to A instead. Not to worry, add a querystring and it thinks it's a new page.
Anyway, then I had XML but still wanted to get this into nice nodes for different relationship types between terms (though wasn't really thinking about SKOS at this point. Doh!). I had high hopes for Pipes. Another doh! Because I would need to go through each item multiple times, renaming each sub-element according to its contents (e.g. broader terms all start "BT ") and trimming the string contents, I was scuppered: you can't loop operator modules, which are the ones that would allow renaming. And you can't rename by a rule, or I couldn't find how and it would probably rely on an operator module anyway. So after a lot of time wasted I thought, sod this, I know how to do this in a minute using XSLT and how important is it to have this as a web service? Fact is, it's not, or at least not in the form of a simple list - I may as well jus have a static file.
So that's what I did. It took more than a minute, though the core code scarcely did. What took longer was digging into SKOS, once it had struck me that it would be the obvious (only) format of choice. It works in a pretty straightforward way, or at least it's easy to do the basics and I didn't need to do more than that. Finding out how to represent it as RDF/XML was not so easy, coz the W3C pages don't show any - they just show TURTLE which isn't that much use to me, really. I needed a full RDF document. XML.com came up with the goods - old, but hopefully valid. So I went ahead and knocked up SKOS RDF for all the letters of the alphabet (bar X - there's nothing in the list starts with X) and merged them into one RDF file, which I hope is valid. I actually have my doubts, but I do know that with this file I can navigate around terms in a way that would be useful to me so that's good enough for me. It's here. I think it would be useful to put a web service on top of this now (perhaps Pipes can come in useful at last) so that it's really an API. Feel free! Oh, go on then, here's a first pass. Won't render as RSS and (consequently?) the "Run Pipe" screen shows nowt, but in debug mode you see results, and as e.g. JSON and PHP.
Next up there are a bunch of thesauri on those sites that I'd like to do a similar thing with, though some are going to be more fiddly. Others may be easier to dapp, but actually I reckon going to SKOS is a better bet and take it from there, as long as the content owners aren't too pissy about me playing with their stuff. Actually what would be most useful is probably to play with some of the word/term lists e.g. the RCHME Archaeological Periods List.
I could get into this.
No comments:
Post a Comment