The Doofer Call: Solr: lessons in love

OK, maybe this will turn out to be the promised foundational intro to Solr/Lucene and the reasons why I've found myself seeing them as some kinda saviour. Or it may just be a few of the lessons I've learned thus far in my experiments.

Solr is...

...a wrapper around the Lucene search engine, a mature OS product from the Apache Foundation, which gives it an HTTP REST interface - in other words, it's got an out-of-the-box web API. So why do we not hear more about it in the world of the digital museum? I've been vaguely aware of it for a long time, and quite possibly there are many museums out there using it, but if so I've not heard of them. We've been talking for so long about the need to get our stuff out there as APIs and yet there's an easy solution right there. OK, perhaps the API isn't quite what you might specify a discover API to be, but then again maybe it is. It certainly enables pretty sophisticated queries (though see below) and one virtue is that the parameters (though not the field names) would be the same across many installations, so if every museum used Solr we'd have the start of a uniform search API. A good start.

Installation and configuration
Dead easy. Read the tutorial. Download the nightly (it comes with examples), though it's not as easy to find as it should be. Being the nightly, it needs the JDK rather than the JRE to run it, and on my machine I had to fiddle about a bit because I have a couple of versions of Java running, so I couldn't rely on the environment variables to start Solr with the right one. This just means putting the full path the the JDK java EXE into your command prompt if you're running the lightweight servlet container, Jetty, that it comes with. This is the easiest way to get going. Anyway I wrote a little BAT file for the desktop to make all this easier and stop fannying about with the CMD window each time I wanted to start Solr.

The other thing to remember with Jetty is that it's run as part of your user session. Now, when you're using Remote Deskop and you log off you see a message to the effect that your programmes will keep running whilst you've logged off. Well for one server this seemed to be true, but when I tried to get Jetty going on the live web server (albeit going via another, intermediate, RDP) it stopped when I disconnected. I thought I'd use Tomcat instead, since that was already running (for a mapping app, ArcIMS), and by following the instructions I had it going in minutes. Now that may seem unremarkable, but I've installed so many apps over the years, and pretty much anything oriented at developers (and much more besides) pretty much always requires extra configuration, undocumented OS-specific tweaks, additional drivers or whatever. With Solr, it's pretty much download, unpack, run - well it was for me. Bloody marvellous.

Replication

Well this couldn't be much easier, and so far no problems in the test environment. Use the lines in the sample config file to specify the master and slave servers, put the same config on both of them (of course) and the slave will pick it up and start polling the master for data.

Data loads/updates

The downside to this seems to be the time it can take to do a full re-indexing, but then that's my fault, really, because I've not done what's necessary to do "delta" updates i.e. just the changes. It can take a couple of hours to index 3000 records from our collections database - these have some additional data from one-to-many relationships, which slows things down

Data modelling and the denormalised life

Before indexing and updates, though, comes modelling. I've only done one thing so far, which is grab records from a relational database, and after a straight-forward basic version (the object record) I played around with adding in related records from subject, date, and site tables. Here I found a problem, which was that the denormalised nature of Solr was, well, more denormalised than was healthy for me. I still haven't quite got my head around whether this is a necessary corrollary of a flattened index, or a design limitation that could be overcome. You get to group your multiple records into a sub-element, but instead of a sub-element for each related record, you get a sub-element for each related column. Basically I wanted a repeating element for, say, "subject", and in that element further elements for ID, subject name, hierarchy level. Instead I get an element containing ID elements, one with subject names etc. Like this I cannot confidently link ID and subject name. My work-around was a pipe-separated concatenated field that I split up as needed, but that's really not ideal.

The other thing I've not yet tried to deal with is bringing in other types of record (or "documents" in the Solr vocab). For instance, full subject records searchable in their own right. Probably they belong in the same index, but it's possible to run a multi-core instance that might be the proper way to handle this. Dunno. Better find out soon though.

Incidentally one of the nice things I've found with this denormalised way of life, with the whole thing set up for speedy search rather than efficient storage or integrity (you have to assume that this is assured by the original data source) is that some of the nightmare data modelling problems I'd struggled with - duplicate images, messy joins - don't really matter much here. go for speed and clean up some of the crap with a bit of XSLT afterwards.

Oh, and a small JDBC tip. I've not used JDBC before (I needed it to connect to SQL Server) but installation was a breeze once I'd figured out which version of the connection string I needed for SQL Server 2000. I needed to drop the jar file into the Solr WAR directory, if I recall correctly - that's where it looks first for any jars - so whilst there may be more "proper" solutions this was effective and easy.

Oops, wrong geo data! (pt. 1)

I already talked about one problem, the lack of a structure to hold related multi-value records usefully. One other problem I had with data modelling was with lat/long data. First problem: it wasn't lat/long data in the lat/long fields. Arse. No wonder it seemed a bit short on decimal points - it was OSGB. I brought in the right data from another database (lat/longs are not in Multi Mimsy at the moment). Job done.

Data typing problems - floats and doubles; oops, wrong geo data! (pt. 2)

....or not. To float or not to float? I'm not very good with databases or, indeed, data types. I knew I needed a Trie field in order to do range queries on it for the geo data. Clearly an integer field would not do, either, these being numbers with lots of numbers to the right of a decimal point (up to 6, I think). A float was my first port of call. Turned out not to work, though, so I tried a double. This worked, but I think I need to change the precision value so that I can use finer-grained query values. Do look at trie fields if you're going to use Solr for range queries, they're pretty damn quick. Sjoerd at Europeana gave me the tip and it works for them.

wt=xslt and KML

One nice thing with Solr, being a REST-y HTTP wrapper for Lucene, is that you can do it all in the querystring. One such thing is specify the transform and get your results out of Solr as you want them, rather than having to pull them into some other environment and do it there. So whilst I was at the in-laws over bank holiday weekend I could RDP to the web server, set up Tomcat and write a quick transform that could be called from the querystring to return KML instead of plain Solr XML. It was at this point that I realised about the geo data problems, but once they were resolved the wt=xslt method was sweet. Though you can't use your favoured XSLT engine - it's Saxon, for better or worse.

Other limitations

This is based on my limited knowledge and experience and so subject to being completely wrong. However...

It's not an RDB. No sub-selects, awkward ways of joining data
I've found that indexing seems to get inconsistent results. It might be that there have been small differences between the data config files each time with big results, but I'm pretty sure that it's just cocking up sometimes. Maybe the machine I'm running or the database server are over-worked, but sometimes I get 2k records rather than 3k, and also I may find that a search for a particular borough returns no results even though I can see records in there with that borough in the text AND the borough field. Somethings wrong there.
flaky query order. I got no results with a query along the lines of , "text contains keyword, and latitude is between val1 and val2", whereas if I did "latitude is between val1 and val2, and text contains keyword" I got loads. Various other fields were sensitive to being first in the query. I can't find this documented as being intentional. Some situations were helped by putting brackets around the bits and pieces, but some weren't. I'd recommend lots of brackets and judicious use of both quotation marks and "+" signs.

So OK, it's not been totally wrinkle free and I'm sure I've forgotten or mixed up stuff, but on the whole Solr has been a great experience. I have a lot still to find out and plenty to fix before my current test dataset is good, but I'm confident that this will end up at the heart of the discovery and web service parts of our planned collections online delivery system. Check it out.

The Doofer Call

About Me

Friday, May 08, 2009

Solr: lessons in love

No comments: