About Me
- Jeremy
- Web person at the Imperial War Museum, just completed PhD about digital sustainability in museums (the original motivation for this blog was as my research diary). Posting occasionally, and usually museum tech stuff but prone to stray. I welcome comments if you want to take anything further. These are my opinions and should not be attributed to my employer or anyone else (unless they thought of them too). Twitter: @jottevanger
Tuesday, May 12, 2009
Do the Europeana survey for purely selfish reasons
Please use Europeana.eu, the European digital library with 4 million digital objects, and join a survey to win iPodTouch.
Friday, May 08, 2009
Solr: lessons in love
Solr is...
...a wrapper around the Lucene search engine, a mature OS product from the Apache Foundation, which gives it an HTTP REST interface - in other words, it's got an out-of-the-box web API. So why do we not hear more about it in the world of the digital museum? I've been vaguely aware of it for a long time, and quite possibly there are many museums out there using it, but if so I've not heard of them. We've been talking for so long about the need to get our stuff out there as APIs and yet there's an easy solution right there. OK, perhaps the API isn't quite what you might specify a discover API to be, but then again maybe it is. It certainly enables pretty sophisticated queries (though see below) and one virtue is that the parameters (though not the field names) would be the same across many installations, so if every museum used Solr we'd have the start of a uniform search API. A good start.
Installation and configuration
Dead easy. Read the tutorial. Download the nightly (it comes with examples), though it's not as easy to find as it should be. Being the nightly, it needs the JDK rather than the JRE to run it, and on my machine I had to fiddle about a bit because I have a couple of versions of Java running, so I couldn't rely on the environment variables to start Solr with the right one. This just means putting the full path the the JDK java EXE into your command prompt if you're running the lightweight servlet container, Jetty, that it comes with. This is the easiest way to get going. Anyway I wrote a little BAT file for the desktop to make all this easier and stop fannying about with the CMD window each time I wanted to start Solr.
The other thing to remember with Jetty is that it's run as part of your user session. Now, when you're using Remote Deskop and you log off you see a message to the effect that your programmes will keep running whilst you've logged off. Well for one server this seemed to be true, but when I tried to get Jetty going on the live web server (albeit going via another, intermediate, RDP) it stopped when I disconnected. I thought I'd use Tomcat instead, since that was already running (for a mapping app, ArcIMS), and by following the instructions I had it going in minutes. Now that may seem unremarkable, but I've installed so many apps over the years, and pretty much anything oriented at developers (and much more besides) pretty much always requires extra configuration, undocumented OS-specific tweaks, additional drivers or whatever. With Solr, it's pretty much download, unpack, run - well it was for me. Bloody marvellous.
Replication
Well this couldn't be much easier, and so far no problems in the test environment. Use the lines in the sample config file to specify the master and slave servers, put the same config on both of them (of course) and the slave will pick it up and start polling the master for data.
Data loads/updates
The downside to this seems to be the time it can take to do a full re-indexing, but then that's my fault, really, because I've not done what's necessary to do "delta" updates i.e. just the changes. It can take a couple of hours to index 3000 records from our collections database - these have some additional data from one-to-many relationships, which slows things down
Data modelling and the denormalised life
Before indexing and updates, though, comes modelling. I've only done one thing so far, which is grab records from a relational database, and after a straight-forward basic version (the object record) I played around with adding in related records from subject, date, and site tables. Here I found a problem, which was that the denormalised nature of Solr was, well, more denormalised than was healthy for me. I still haven't quite got my head around whether this is a necessary corrollary of a flattened index, or a design limitation that could be overcome. You get to group your multiple records into a sub-element, but instead of a sub-element for each related record, you get a sub-element for each related column. Basically I wanted a repeating element for, say, "subject", and in that element further elements for ID, subject name, hierarchy level. Instead I get an element containing ID elements, one with subject names etc. Like this I cannot confidently link ID and subject name. My work-around was a pipe-separated concatenated field that I split up as needed, but that's really not ideal.
The other thing I've not yet tried to deal with is bringing in other types of record (or "documents" in the Solr vocab). For instance, full subject records searchable in their own right. Probably they belong in the same index, but it's possible to run a multi-core instance that might be the proper way to handle this. Dunno. Better find out soon though.
Incidentally one of the nice things I've found with this denormalised way of life, with the whole thing set up for speedy search rather than efficient storage or integrity (you have to assume that this is assured by the original data source) is that some of the nightmare data modelling problems I'd struggled with - duplicate images, messy joins - don't really matter much here. go for speed and clean up some of the crap with a bit of XSLT afterwards.
Oh, and a small JDBC tip. I've not used JDBC before (I needed it to connect to SQL Server) but installation was a breeze once I'd figured out which version of the connection string I needed for SQL Server 2000. I needed to drop the jar file into the Solr WAR directory, if I recall correctly - that's where it looks first for any jars - so whilst there may be more "proper" solutions this was effective and easy.
Oops, wrong geo data! (pt. 1)
I already talked about one problem, the lack of a structure to hold related multi-value records usefully. One other problem I had with data modelling was with lat/long data. First problem: it wasn't lat/long data in the lat/long fields. Arse. No wonder it seemed a bit short on decimal points - it was OSGB. I brought in the right data from another database (lat/longs are not in Multi Mimsy at the moment). Job done.
Data typing problems - floats and doubles; oops, wrong geo data! (pt. 2)
....or not. To float or not to float? I'm not very good with databases or, indeed, data types. I knew I needed a Trie field in order to do range queries on it for the geo data. Clearly an integer field would not do, either, these being numbers with lots of numbers to the right of a decimal point (up to 6, I think). A float was my first port of call. Turned out not to work, though, so I tried a double. This worked, but I think I need to change the precision value so that I can use finer-grained query values. Do look at trie fields if you're going to use Solr for range queries, they're pretty damn quick. Sjoerd at Europeana gave me the tip and it works for them.
wt=xslt and KML
One nice thing with Solr, being a REST-y HTTP wrapper for Lucene, is that you can do it all in the querystring. One such thing is specify the transform and get your results out of Solr as you want them, rather than having to pull them into some other environment and do it there. So whilst I was at the in-laws over bank holiday weekend I could RDP to the web server, set up Tomcat and write a quick transform that could be called from the querystring to return KML instead of plain Solr XML. It was at this point that I realised about the geo data problems, but once they were resolved the wt=xslt method was sweet. Though you can't use your favoured XSLT engine - it's Saxon, for better or worse.
Other limitations
This is based on my limited knowledge and experience and so subject to being completely wrong. However...
- It's not an RDB. No sub-selects, awkward ways of joining data
- I've found that indexing seems to get inconsistent results. It might be that there have been small differences between the data config files each time with big results, but I'm pretty sure that it's just cocking up sometimes. Maybe the machine I'm running or the database server are over-worked, but sometimes I get 2k records rather than 3k, and also I may find that a search for a particular borough returns no results even though I can see records in there with that borough in the text AND the borough field. Somethings wrong there.
- flaky query order. I got no results with a query along the lines of , "text contains keyword, and latitude is between val1 and val2", whereas if I did "latitude is between val1 and val2, and text contains keyword" I got loads. Various other fields were sensitive to being first in the query. I can't find this documented as being intentional. Some situations were helped by putting brackets around the bits and pieces, but some weren't. I'd recommend lots of brackets and judicious use of both quotation marks and "+" signs.
So OK, it's not been totally wrinkle free and I'm sure I've forgotten or mixed up stuff, but on the whole Solr has been a great experience. I have a lot still to find out and plenty to fix before my current test dataset is good, but I'm confident that this will end up at the heart of the discovery and web service parts of our planned collections online delivery system. Check it out.
Thursday, May 07, 2009
A new Head of Communications
Hopefully before that point we will have started or even completed the process of finding a replacement for Mia, who we lost to the Science Museum all those months ago. The vacant post is being boosted to "Digital Museum Manager", to make up for the fact that we have no manager responsible for web and digital media since October, for reasons it would be imprudent to expand upon here. We need someone at that level to take on the planning, policy and strategic work that the HoC will be too busy to deal with, given that he's covering the whole of communications (internal and external), but we also have to have a developer to fill the gap that Mia left so this will be a pretty hands-on post, with probably more time coding than managing. We'll have to see if this proves sufficient, since even when we were fully staffed we were short-staffed.
Looking at Mr Robbin's profile it is good to see that internal communications are part of his skill-set. I think it's broadly felt at all levels here that MOL needs to work on this area in order to strengthen us as a corpus of colleagues with a commonly understood direction, and it will be interesting to see how our internal comms evolve in the coming months. Between now and July there's a lot that needs doing, so we'll have to muddle on in the meantime, but overall an interesting time ahead.
Tuesday, May 05, 2009
CFP for VALA2010
VALA promotes the use and understanding of information and communication
technologies across the Galleries, Libraries, Archives and Museum sectors.
The CFP is here but the deadline is nearly up (although the conference isn't until Feb 2010)
Museums Association digital events
The only conference in that area that I recall the MA running was, ooh, 2006 or so, but there are two more coming up. In June we have World wide wonder: museums on the web (NOT to be confused with the long-standing MGC-run UK Museums on the Web conference that I presume will take place later that month). There are some great people lined up for that, with perspectives ranging from academic to managerial to dirty-hands coder to strategic.
Then on September 18th is "Go digital: New trends in electronic media", which looks like it draws upon the sources interviewed for the MP special (including the director of public programmes here, David Spence). In contrast to June, it looks like it's going to be focussed on off-line media.
Monday, May 04, 2009
A dawning realisation?
Nowadays everyone I talk to questions the metrics they use. More than that, people seem keener to dig into what they may mean in terms of value. Seb Chan is amongst those in our sector that's exploring how to make better, and better use of, measurements, and closer to home, Dylan Edgar's work with the London Hub dug into similar issues.
Last week in a catch-up with the director of my division we touched on his own objective of "improving the website". In itself it's encouraging that the objective is there, as part of the reorganisation we are currently experiencing, but "improving the website" is a pretty broad ambition. I think it's a subject that we'll revisit in more depth soon, but it was clear that our director was as aware as we web types were that when you lift up that rock you'll find a tangled mess of questions. Before you talk about "improving" you need to identify what you consider to be valuable, and to disentangle theoretical "improvements" from impact, preparedness, experimentation etc. Obviously a set of measurements that to some degree reflect these valued qualities are a sine qua non for managing their realisation, and so here's a reference to provoke a little more thought on the subject that I won't dig into here, but has had me rethinking my own attitudes web stats and the whole evaluation problem: Douglas W Hubbard, 2007, How to measure anything : finding the value of "intangibles" in business. *
In any case I find it encouraging that in this discussion and others with senior colleagues there seems to be a dawning awareness that we have a complex, multidimensional environment to deal with, wherein the varieties of "success" may be as varied as between all the departments within a museum. I'm not sure that it would always have been the case that the higher echelons were aware of the perils of trying to evaluate our digital programmes, although perhaps any senior manager worth their salt will have long ago twigged that a website is not "improved" merely by adding pages, Flash splashes and video - evaluating the more familiar physical museum is no easier, after all, and nor is improving it. We do need to have that conversation about what we mean by "website" with senior management, though. Is it only geeks that see this as only a part of our digital presence?
When it comes to the use of web stats of various sorts, there have always been lots of complaints about them, but I suspect that in this discussion too we are seeing greater recognition that it's not about visitors versus hits. Maybe it's not even enough to focus on "impact" since the heart of the matter arguably lies a level deeper than that: the first step is figuring out what impact itself means in the context of the museum's mission, and in this networked environment in the mission of the meta-museum that we must realise we are a part of.
Rhetorical question for the day, then: Is there a mission for the meta-museum, and do we measure up to it?
*I hope to post about this book properly, eventually, but don't wait for that, try to check out the book which, for all its flaws of repetition, is full of useful ideas and tools.
From the library: Renaissance and metrics
It's not a brilliant piece, to be honest; it's limited by reference to online publications and ends up muddling the question of what data are gathered with that of what is made available on public websites. Everitt was writing in advance of a review being conducted for the MLA (review FAQs) by an advisory group led by Sara Selwood, Phase 1 of which was to be completed last autumn so as to inform the business plan for the years ahead [note to self: track down other Selwood refs on data collection in cultural heritage]. Because of this it's quite likely that Everitt's findings were out of date before they were even accepted for publication. All the same there are some interesting points within the paper. For example, despite the declared intention of Renaissance to standardise methods of evaluating impact, Everitt finds notable variability in how this is actually undertaken. Two Public Service Agreement targets are applied to Renaissance, and measurements against these seem to be uniform, but beyond this and the headline figures there is less consistency; likewise the approaches to making their data, analysis and reports public vary greatly. I also discovered that the MLA also offer a set of Data Collection Guidelines and templates, which I now need to digest. Presumably this 2008 manual (PDF) is the replacement for the 2006 version that Everitt was refering to, and here's a page on the MLA site about the results to 2006.
I look forward to seeing whatever parts of the Selwood-led review are published. The overall direction of Renaissance is up for grabs, it would seem, which could have a big impact in the Museum of London, for one. I will be especially interested, though, in the data collection strand, and in how they suggest we evaluate impact.
ICHIM and DISH
I hadn't twigged that the 2007 ICHIM was in fact the last of that long-running series of bi-annual conferences, which ran, amazingly, from 1991. April's issue of Curator starts off with an interview with David Bearman on the ICHIM's history, why it ended, and what next. Let's not forget that dbear and Jennifer Trant also run the universally adored and enormous Museums and the Web conferences, but ICHIM covered somewhat different territory and arguably there's a space that needs filling now...
...which is why it was timely that on the same day I found that interview, I also read about DISH2009:
"Digital Strategies for Heritage (DISH) is a new bi-annual international
conference on digital heritage and the opportunities it offers to cultural
organisations."
DISH 2009 takes place in Rotterdam December 8-10th, and the CFP is up. It looks interesting: taking a step back to look at strategic questions of innovation, collaboration, management etc.