About Me
- Jeremy
- Web person at the Imperial War Museum, just completed PhD about digital sustainability in museums (the original motivation for this blog was as my research diary). Posting occasionally, and usually museum tech stuff but prone to stray. I welcome comments if you want to take anything further. These are my opinions and should not be attributed to my employer or anyone else (unless they thought of them too). Twitter: @jottevanger
Monday, November 07, 2011
New IWM websites pt.III: the, um, website
Why a new website?
For the last 8 (I think) years, IWM has used BoxUK’s Amaxus CMS to run its websites. Naturally the sites were getting creaky and the CMS itself has been superseded, IWM itself has changed, and so has how the web works – both technically and in terms of the behaviour of web users and the language of interaction that they understand. A clean sweep was in order, which means a variety of strands of work. This much was clear when Carolyn Royston (our head) and Wendy Orr (our Digital Projects Manager, and so lead on the website project) outlined their ambitions to me when I started at IWM in May 2010.
Choosing the platform
Research for the new sites had began some time before that May but planning really kicked off in June. For me, the first key deliverable was the selection of a technical solution, but although I had a fair idea which way I would go I wanted to know a variety of other things first. How can you choose a CMS without knowing the functional specification, and how can you really know that without settling the information architecture to some degree, and the ways that people will find content and interact with the site? Decisions on whether we’d be supporting a separate mobile site, for instance (we don’t, at least not for now), and our plans for legacy sites all could have an impact. But of course you can only work out so much of this beforehand, and most questions seem to lead to others in a Gordian knot, so in the end you have to assess the situation as best you can, put together your own set of technical priorities, and make your selection as something of a leap of faith. I had the benefit of advice from various knowledgeable people in the sector who told us of their experiences with various CMSs, in particular IMA’s Drupal mage Rob Stein and the V&A’s Richard Morgan and Rich Barrett-Small, and we also had demos of a couple of commercial CMSs. Most importantly, though, we had Monique Szpak, whose role in this project (and my learning process at IWM) really needs a blog post of its own. Her experience with various open source products including Drupal was key, and after we identified that as our preferred solution she built us a proof-of-concept late last year to confirm that Drupal was likely to be able to do what we needed, and to assess the likelihood that Drupal 7, which at that point was still in alpha, would be ready when we needed it. With this information we took an informed gamble that it would be, and the choice was made.
Development
As I already said, we started development work even before settling finally on Drupal, as a piloting project, and this continued whilst we were developing our plans for content, IA and design. There were a number of things we knew we wanted, even if the functionality was still hazy – with Monique’s help we’ve instituted agile practices which positively encourage trial, error, testing and improvement. This change, in fact, together with the development environment we’ve gradually (& painfully) pieced together and the implementation of tools like Jira and Subversion, has been fundamental to making this project work, and it would have been impossible without Monique. Whilst she worked on prototyping more functionality, I did some groundwork on indexing shop and external sites. Then in the spring Toby Bettridge joined us, fresh from working on the Drupal part of the V&A’s new site. He and Monique worked very closely (with the help of Skype) and long before the design work was complete we had basic versions of the taxonomy, events, multi-index search and collections functionality done, amongst other things.
Although I’ve been paying attention to what they do, my hands-on involvement in Drupal development has been pretty much nil and I still understand the CMS far less than I’d like, so anything I say about Drupal development here needs to be read with that in mind! I do get, though, that one picks modules carefully, develops new ones with reserve, and never hacks core. We started developing with Drupal 7 before it was released, and even when it was there were (and remain) quite a lot of modules that weren’t ready to use. We thought the gamble was worthwhile, though, and forged ahead. In time we did incorporate some of them, although unfortunately we still don’t have some of the things promised by e.g. Workbench. Along the way Monique and Toby also did some vital module development of their own, notably a custom collections search module (using Search Api Solr Search), media embedding for authors using IWM’s oEmbed service, entity lists (old-style Drupal nodes had lists, but new-style entities didn’t), and some administrative tools.
My role in development? I’ve often felt somewhat awkward, if I’m honest, about my fit, because having elected to go with an entire technology stack and various development practices that were new to me, I often found I couldn’t really contribute practically, even where I understood some things well. For instance, although I have plenty of experience with Solr, my practical contribution to integrating it with Drupal was negligible; likewise if I knew what was required to fix some HTML/JS/CSS at the front end, I could not implement this in an unfamiliar environment for fear of messing up Drupal or making some Subversion faux pas. I think I’ve made but one single (successful) check-in of Drupal code. I concentrated instead on sorting out development and live hosting, working on getting the collections data right, filling the holes in the spec as we noticed them, and so on. I spent a good while working out how the media streaming worked and how to embed that in our pages, using the DAMS’ web service to build a light-weight SOAP-free alternative (an oEmbed service) that could both serve our websites and potentially 3rd parties. When everything calms down, though, I need to properly get to grips with the codebase.
Information architecture, discovery & URLs
Working out the logic for a site that’s going to function for several years is not easy. One can change that logic if necessary, but you really need to know how likely that is to happen in order to give your designers some parameters to work within – how flexible do menus have to be? How directed will the user be, and how much should any one piece of content be located in a specific part of the site? As I said earlier, the brand structure and the 5 IWM branches were a big factor in how we had to organise content, since we needed to make things readily discoverable whatever the user’s journey, but without making them context-free and confusing. Another pair of conflicting priorities were the wish to avoid having too many top-level menu items and the wish to keep the site fairly flat without obliging too many clicks to find content.
Sites like the V&A, which relaunched earlier this year, have taken adventurous routes to delivering masses of content to users (or users to content) – in the V&A’s case, centring around search and introducing a sort of machine-learning to categorise content and indeed to identify what categories might exist. Brave stuff, and a great solution to the huge volume of content they have there.
At IWM we played for a while with the idea of a taxonomy driven site, wondering if we could use a set of taxonomies as facets onto different aspects of the site that would let users cut across a traditional hierarchical organisation of content. We’ve kind of gone with a watered-down version of that, wherein the structure of the content is fairly obvious and on the whole quite flat but we’ve used controlled terms and free tagging to help make things more discoverable to users coming from other angles. This is pretty conventional and at the moment of limited power, but in due course we will make greater efforts to align our taxonomies (in particular our history taxonomy) with the controlled terminology used in our collections. This was too much to do in this phase, but when that happens we should be able to make ever-better connections between our collections and pages like our “Collections In Context” history pages, learning resources, galleries and perhaps events. A learning-focused vocabulary will do the same, but right now our e-learning resources are pretty much non-existent.
Perhaps more important than taxonomy at the moment is search, which has been a key way of integrating content that lies outside our main site. We’ve elected to run 4 separate Solr indexes for this, and to keep them separate owing to the distinctive nature of their content. We have the Drupal index itself; collections data; an index of products extracted from our Cybertill e-shop; and a crawl (using Nutch) of a number of IWM sites that are outside of Drupal, such as blogs and the “Their Past, Your Future” learning resource. The last one needs a lot more work but as a quick-and-dirty way of ensuring that those legacy sites weren’t left out in the cold it works. And yes, a Google custom search engine would have been an alternative but then it would not have worked in the same way as the other searches, with deep integration into Drupal and the ability to treat the results as entities and reuse them elsewhere.
One obvious change with the new site is that, well, it’s one site. Previously we used a morass of subdomains for somewhat independent branch sites and even for the collections-pages-that-weren’t-collections-search (collections search had its own domain, no less). I for one found it pretty confusing. With the rebrand making the “IWM-ness” of all of our branches more prominent we were able to do the same on the website. I had been through a similar exercise at the Museum of London, and though it was not an identical situation some conundrums and dilemmas were shared by both. How to make it easy to access non-branch specific content and information to all users in the same place as branch-specific content, and how to make sure that people are well aware that the latter pertains only to one physical site? How to cross-promote? Like MoL we had no specific digital brand, nor a mother brand to distinguish particular projects or sites from cross-organisational activities. I hope we found a solution that works for our users and not just for IWM itself, but time (and more user-testing) will tell.
The expectation that we’d move content around and that the organisation of material on the site, as seen by users in menus etc, would not be forever, prompted me to seek a URL structure that was a little more abstract. I didn’t want URL components necessarily to be the same as top-level menu items, which might disappear, but to relate to more stable concepts of what IWM does and offers whilst remaining meaningful. That doesn’t mean permanent URLs but hopefully relatively long-lasting and predictable ones. In one area – collections – we do aim for the URLs to be “permanent”, though (whatever that means). What I tried to do was put what I imagined to be the most stable aspects towards the left hand end of the URL path, things like “corporate” and “visits”, because I envisaged these as being more stable than even branch names (we might get more branches, or rename them again). I also wanted to be able to put non-branch content under these. The result is that we don’t have branch names at the top of a hierarchy but reappearing in a few places – visist/iwm-london as well as events/iwm-london and others. It may seem messy but I hope it’s reasonably predictable all the same, and it means we never need a catch-all URL to cope with the miscellany that we hadn’t foreseen would ever exist outside branches.
Design
We appointed the Bureau for Visual Affairs, who were responsible for the National Maritime Museum’s new website’s design, to do the same for us. Judge for yourself how they’ve done, although good or bad the credit or blame are not all theirs, even when it comes to aesthetics. Design and content go hand in hand, and in some places we’re still working to improve the latter to make the best of the former. Under the covers, too, the HTML that’s spat onto the page is the result of BVA’s HTML coders’ fine work at one end, Drupal at the other, and the best efforts of our devs to bridge the gap. And sometimes the gap was pretty big.
The theming process was one area where our plans went somewhat awry. We had two experienced Drupal developers on our team, but as there was plenty for them to do in back-end development we were planning on the theming being handled by whoever we appointed as designers. BVA, however, are not a Drupal house but their design was what got us all excited, so we reached an arrangement with a third company to subcontract to BVA to do this part of the work. Having done it once, this is not something I would recommend - at least not unless you can make it very clear who answers to whom and where the line lies between development work, theming, and HTML development (and who pays who for what). We ended up some weeks behind but got back on track with the help of Ed Conolly of http://www.inetdigital.co.uk, who moonlighted as a themer for a few weeks and helped put a spring back in everyone’s step. Bravo Ed!
Content
Early in our content planning we decided what we’d migrate from the old sites (not a lot), what we’d need to keep going (a small set of microsites) and, broadly speaking, what we’d want to add to the new site. Killing off content doesn’t usually sit too well with me, who’s a conservationist and archivist by inclination. My instinct is that it’s sure to be useful to someone to have pretty much everything we’ve ever done remain available, but that’s nonsense really and far from helping people could end up confusing them, not to mention sucking up resources for maintenance that would be much better spent on creating new content of real worth. We did have an awful lot of pages that related to old exhibitions and so on, and were very keen to disentangle ourselves as fully as possible from our old content management system, Amaxus 3. In the end we have kept three or four microsites from that. Other content needed substantial alterations to bring it up to date and suit it to the new site structure.
However, beyond the core, practical information about visits etc., we wanted to do something that would directly serve the core purpose of the IWM: to tell the stories of conflict through the material we hold; and we wanted to do it in a rich, immersive way. BVA came up with a solution that looked lovely, although we went through a few iterations in order to make it easier to create the content and to draw parts of it from the collections middleware. We wanted HTML that could be generated almost automatically, which opens up other potential uses for the template. This took away some of the visual sophistication with which BVA won our hearts, and I suspect that they were a little unhappy to see this go, but this is a site that we want to add to frequently and without having to use HTML developers to do it, so I think we found a happy medium. Our “Collections in Context” (or simply, “history”) section contains over 100 articles at present, using images, audio and video to tell stories spanning from the First World War to the present conflicts in which the UK is involved. They were written by one of IWM’s historians and put carefully worked into the CMS by our team in a close collaboration that we hope to turn into a rolling programme of content creation, perhaps reflecting current events or notable anniversaries. I hope in due course we can extend the use of the format to other parts of the site and other voices, perhaps enabling its use as a tool for our website visitors. The people who deserve a shout-out for writing, editing, and/or inputting the hundreds of content pages that make up the new site are New Media’s Jesse Alter and Janice Phillips together with Maggie Hills, who has joined us for a busy few months.
BVA brought a couple of other bits of bling to the site, with the aim of a more engrossing, immersive experience. First amongst these is the “visual browse”, a slideshow mechanism that underlies many of our pages and is brought to the fore by clicking a tab at the top left. We can make any number of these and surface them on the pages where they are relevant – for instance, each branch has its own visual browse.
Is it any good?
When I stand back from whatever details might be preoccupying me on a given day I’m really pleased with the overall effect of what we’ve done, but of course I am not a typical user and what will count will be the feedback we get from our users. But for me, I’m especially pleased with the history pages and the way that our collections are now used there and in the search pages. I am also pretty pleased with the balance we’ve found between the individual branches (essentially, the needs of the physical visitor) and the cross-branch/non-branch activities and content, but because this is necessarily a compromise I expect that it will not work for everyone.
I have reservations too. I think the lack of a mother brand is a problem, and I think we need to make the home page work harder to offer a powerful message of what IWM as a whole is. The lack of fly-out menus is galling to me, although the ones for branches work well. It means more of a leap into the unknown and more clicks to find what you’re after. Our lovely, lovely history content is hard to find. Mobile performance is not that great – the whole site is too wide to load full-width with the text legible, and a ton of stuff loads onto the page that is not important for the mobile user. It functions OK, but it’s far from an optimised experience.
So, my opinions aside, there’s plenty to do over the coming months. But it will feel mighty good to have this milestone out of the way: November 8th – switchover day.
Friday, May 08, 2009
Solr: lessons in love
Solr is...
...a wrapper around the Lucene search engine, a mature OS product from the Apache Foundation, which gives it an HTTP REST interface - in other words, it's got an out-of-the-box web API. So why do we not hear more about it in the world of the digital museum? I've been vaguely aware of it for a long time, and quite possibly there are many museums out there using it, but if so I've not heard of them. We've been talking for so long about the need to get our stuff out there as APIs and yet there's an easy solution right there. OK, perhaps the API isn't quite what you might specify a discover API to be, but then again maybe it is. It certainly enables pretty sophisticated queries (though see below) and one virtue is that the parameters (though not the field names) would be the same across many installations, so if every museum used Solr we'd have the start of a uniform search API. A good start.
Installation and configuration
Dead easy. Read the tutorial. Download the nightly (it comes with examples), though it's not as easy to find as it should be. Being the nightly, it needs the JDK rather than the JRE to run it, and on my machine I had to fiddle about a bit because I have a couple of versions of Java running, so I couldn't rely on the environment variables to start Solr with the right one. This just means putting the full path the the JDK java EXE into your command prompt if you're running the lightweight servlet container, Jetty, that it comes with. This is the easiest way to get going. Anyway I wrote a little BAT file for the desktop to make all this easier and stop fannying about with the CMD window each time I wanted to start Solr.
The other thing to remember with Jetty is that it's run as part of your user session. Now, when you're using Remote Deskop and you log off you see a message to the effect that your programmes will keep running whilst you've logged off. Well for one server this seemed to be true, but when I tried to get Jetty going on the live web server (albeit going via another, intermediate, RDP) it stopped when I disconnected. I thought I'd use Tomcat instead, since that was already running (for a mapping app, ArcIMS), and by following the instructions I had it going in minutes. Now that may seem unremarkable, but I've installed so many apps over the years, and pretty much anything oriented at developers (and much more besides) pretty much always requires extra configuration, undocumented OS-specific tweaks, additional drivers or whatever. With Solr, it's pretty much download, unpack, run - well it was for me. Bloody marvellous.
Replication
Well this couldn't be much easier, and so far no problems in the test environment. Use the lines in the sample config file to specify the master and slave servers, put the same config on both of them (of course) and the slave will pick it up and start polling the master for data.
Data loads/updates
The downside to this seems to be the time it can take to do a full re-indexing, but then that's my fault, really, because I've not done what's necessary to do "delta" updates i.e. just the changes. It can take a couple of hours to index 3000 records from our collections database - these have some additional data from one-to-many relationships, which slows things down
Data modelling and the denormalised life
Before indexing and updates, though, comes modelling. I've only done one thing so far, which is grab records from a relational database, and after a straight-forward basic version (the object record) I played around with adding in related records from subject, date, and site tables. Here I found a problem, which was that the denormalised nature of Solr was, well, more denormalised than was healthy for me. I still haven't quite got my head around whether this is a necessary corrollary of a flattened index, or a design limitation that could be overcome. You get to group your multiple records into a sub-element, but instead of a sub-element for each related record, you get a sub-element for each related column. Basically I wanted a repeating element for, say, "subject", and in that element further elements for ID, subject name, hierarchy level. Instead I get an element containing ID elements, one with subject names etc. Like this I cannot confidently link ID and subject name. My work-around was a pipe-separated concatenated field that I split up as needed, but that's really not ideal.
The other thing I've not yet tried to deal with is bringing in other types of record (or "documents" in the Solr vocab). For instance, full subject records searchable in their own right. Probably they belong in the same index, but it's possible to run a multi-core instance that might be the proper way to handle this. Dunno. Better find out soon though.
Incidentally one of the nice things I've found with this denormalised way of life, with the whole thing set up for speedy search rather than efficient storage or integrity (you have to assume that this is assured by the original data source) is that some of the nightmare data modelling problems I'd struggled with - duplicate images, messy joins - don't really matter much here. go for speed and clean up some of the crap with a bit of XSLT afterwards.
Oh, and a small JDBC tip. I've not used JDBC before (I needed it to connect to SQL Server) but installation was a breeze once I'd figured out which version of the connection string I needed for SQL Server 2000. I needed to drop the jar file into the Solr WAR directory, if I recall correctly - that's where it looks first for any jars - so whilst there may be more "proper" solutions this was effective and easy.
Oops, wrong geo data! (pt. 1)
I already talked about one problem, the lack of a structure to hold related multi-value records usefully. One other problem I had with data modelling was with lat/long data. First problem: it wasn't lat/long data in the lat/long fields. Arse. No wonder it seemed a bit short on decimal points - it was OSGB. I brought in the right data from another database (lat/longs are not in Multi Mimsy at the moment). Job done.
Data typing problems - floats and doubles; oops, wrong geo data! (pt. 2)
....or not. To float or not to float? I'm not very good with databases or, indeed, data types. I knew I needed a Trie field in order to do range queries on it for the geo data. Clearly an integer field would not do, either, these being numbers with lots of numbers to the right of a decimal point (up to 6, I think). A float was my first port of call. Turned out not to work, though, so I tried a double. This worked, but I think I need to change the precision value so that I can use finer-grained query values. Do look at trie fields if you're going to use Solr for range queries, they're pretty damn quick. Sjoerd at Europeana gave me the tip and it works for them.
wt=xslt and KML
One nice thing with Solr, being a REST-y HTTP wrapper for Lucene, is that you can do it all in the querystring. One such thing is specify the transform and get your results out of Solr as you want them, rather than having to pull them into some other environment and do it there. So whilst I was at the in-laws over bank holiday weekend I could RDP to the web server, set up Tomcat and write a quick transform that could be called from the querystring to return KML instead of plain Solr XML. It was at this point that I realised about the geo data problems, but once they were resolved the wt=xslt method was sweet. Though you can't use your favoured XSLT engine - it's Saxon, for better or worse.
Other limitations
This is based on my limited knowledge and experience and so subject to being completely wrong. However...
- It's not an RDB. No sub-selects, awkward ways of joining data
- I've found that indexing seems to get inconsistent results. It might be that there have been small differences between the data config files each time with big results, but I'm pretty sure that it's just cocking up sometimes. Maybe the machine I'm running or the database server are over-worked, but sometimes I get 2k records rather than 3k, and also I may find that a search for a particular borough returns no results even though I can see records in there with that borough in the text AND the borough field. Somethings wrong there.
- flaky query order. I got no results with a query along the lines of , "text contains keyword, and latitude is between val1 and val2", whereas if I did "latitude is between val1 and val2, and text contains keyword" I got loads. Various other fields were sensitive to being first in the query. I can't find this documented as being intentional. Some situations were helped by putting brackets around the bits and pieces, but some weren't. I'd recommend lots of brackets and judicious use of both quotation marks and "+" signs.
So OK, it's not been totally wrinkle free and I'm sure I've forgotten or mixed up stuff, but on the whole Solr has been a great experience. I have a lot still to find out and plenty to fix before my current test dataset is good, but I'm confident that this will end up at the heart of the discovery and web service parts of our planned collections online delivery system. Check it out.
Friday, March 27, 2009
Evaluate this
My problem: to create/declare javascript variable names dynamically. I have a loop in this little SVG experiment I'm doing with the Raphael javascript SVG library (another post to come on this) where I want to make a "set" of SVG elements out of each item in an array of unknown length. I also need to attach an onclick function to each set. For creating the set, putting items into it, and attaching the event handler I need a variable name to be made on the fly. This is how to declare it:
eval("var r" +i +" = dynamically named variable'");If i is currently 2, this creates a string variable with the name r2 and the value "dynamically named variable". To get the variable value you have to use the eval() method again, thus:
alert(eval('r'+i));Probably old hat, this stuff, but I'm not too proud to show my ignorance, at least when I've just reduced it slightly!
Wednesday, September 24, 2008
A quick test of SemanticProxy: what, did you expect it to be perfect?
Take this page for example: Shakespeare’s first theatre uncovered. Paste the URL into the box on the demo page. If you look at the entities SemanticProxy identifies, some are impressively accurate. For example, it spots Jack Lohman, Jeff Kelly and Willian Shakespeare as people; identifies currency, facilities and companies reasonably well; finds phone numbers etc.
On the down side, quirks include that it considers the Museum of London a "facility" not a company or organisation; designates Chelsea Old Church (mentioned only in the navigation) both a person and organisation; thinks Taryn Nixon is the director of Tower Theater Company (though the text says "Taryn Nixon, Director of Museum of London Archaeology"); and calls Shakespeare a Hackney planning officer!
SematicProxy looks very impressive, still, but this quick test does at least illustrate what a fiendish problem these guys are trying to tackle. The team point out that it's a beta: "No guarantees, no promises", they say, and I hope they stick at it and that I get to play with it properly some time soon!
Friday, June 06, 2008
Small API update
- fixed a bug on the geo thing. For some reason an imbecilic code error wasn't breaking the script on my machine, but did on the web server. Now fixed.
- a CDWALite-lite output for individual object records (example). There's more to add, glitches to fix, and ideally a better solution to the URL, but it's a start. Next thing is a search interface but that depends upon agreement within the Museum. A good solution may be to combine CDWALite and OpenSearch-style RSS, with the records enabling users to find the data end-point, as well as the HTML rendering. In due course I'll probably add
tags to HTML record pages to point at data like this, or I may do it with some POSH. - the photoLondon website data now has a basic API, which I'll put on the live site next week. It returns basic person details and search parameters include: surname, forename, keyword, birth year, death year, "alive in" year, gender, country of origin, photographic occupation and non-photographic occupation. I'll work on the search result format soon, as well as the person details.
Wednesday, June 04, 2008
MoL APIs live (but very alpha)
To start you off, here are links to one request from each service.
- events API. This won't work forever as the events will expire. Uses the format that Upcoming outputs with xCal, DC and geo extensions
- publications API
- the geothingy converter whatsit
Tuesday, June 03, 2008
The browser on the server
Incidentally, one of the things that's nice about JavaScript is that it's all focussed around the DOM, unlike many other languages, and I'd think it's pretty forgiving too (depending on the engine). All of which makes it a good candidate for screen scraping. And one thing I've not yet found is a way to tie this into .Net nicely
- http://en.wikipedia.org/wiki/Server-side_JavaScript This lists loads of products for various environments that will run JS server-side. Looks like Mozilla's Rhino and SpiderMonkey products are the basis for most of them.
- http://ejohn.org/blog/server-side-javascript-with-jaxer/. This must be the first stop for me. How cool is that? Here's jaxer itself: http://www.aptana.com/jaxer/. Yes, it is based on Mozilla. Damn, it's for Apache, but it's not a deal-breaker.
- from the same source but older: http://ejohn.org/blog/bringing-the-browser-to-the-server/ Look in the comments too, plenty of starting points there
- I also turned up some of what I take to be the original Netscape JavaScript documentation, which talks a lot about server-side javascript but in relation to the old Netscape Enterprise Server