Mike Ellis and Dan Zambonini recently unveiled hoard.it, which Mike showed me a draft of some time ago but which looks like it's come a long way. Hoard.it is basically not unlike Dapper in that you can teach it to screen-scrape sites, but instead of basically making an API onto those sites it will ingest the data and then offer it up through an API once scraped (not sure if this is live yet). It can screen-scrape by spidering specified sites. I don't know if their other plans are yet implemented but they also hope to let you scrape a given page via a bookmarklet, and to give the app the ability to match up a given page to templates in its memory. What they're showing is more targetted than the application is capable of because it's aimed specifically at gathering museum collections data, and displays it all appropriately, whereas of course the whole screen-scraping thing that hoard.it is capable of has many more uses than that. It's cool, and in typical style it's an example of "just get it done" tech designed to get something useful underway, even if it's but a stepping stone to something else.
Making templates for a screen-scraper is one way to gather data from the "surface" of the web. Another way to achieve something similar is to embed some sort of standardised "meaning" in the HTML. Microformats are one such route, as are various other flavours and approaches to Plain Old Semantic HTML (POSH). Early last year I put aside my effort to test this idea out for indicating and gathering museum objects. I called it Salad Bowl but Gathery has taken its place as my favoured moniker. Nothing else has changed in the last year, though.
Gathery is a test of two things, really: firstly, the idea of using microformat-like POSH for museum objects; and secondly, the dream I've long had of being able to gather things I liked as I explored the web. Technically there are three elements, I suppose: the POSH; a bookmarklet to identify any number of objects on a page and pass your selected one to the application; and the application that takes that data, processes it (including looking for better quality data at a "home" URL - say, an OAI-PMH source) and lets the user do stuff like tag and describe it, and ultimately feed it back out. It's functional but not complete (I laid it aside because I was unsure which direction to go with the POSH, not to mention being quite busy), but in many ways it's similar to what Dan and Mike are doing with hoard.it. When you boil it down, the differences come down to the screen-scraping as opposed to the POSH approach.
So I've been trying to draw out the pros and cons of hoard.it and Gathery (the latter will probably never come to anything), but essentially it's a comparison of screen-scraping and POSH (though a pro for one isn't automatically a con for the other). Bear in mind that I have at the back of my mind a set of questions relating to how we move towards a semantic web for museums, as well as how to achieve the things I've always wanted as a user, namely an easy way to search and gather material from collections all round the web. Obviously my comparison is based on a good knowledge of what I built and pretty thin knowledge about hoard.it, and since I've not publicised Gathery it's not easy for you to test the comparison (though do ask for the URL if you're interested). But take think my pros and cons as relating to screen-scraping and POSH and let me know if you think they cover the important aspects, and tell me where I'm wrong and what I'm missing.
Gathery (or microformat/POSH approaches)
- m-object, The POSH I drafted and tested with Gathery, has two levels: content indicator (points to the "home" URL where the best content resides, includes a GUID and an institutional identifier); and content carrier.
- content carrier is optional, so it is not necessary for content to reside on the page: the author can choose to do no more than indicate that an object's record exists somewhere else
- authors using explicitly-chosen standards should be less fragile than screen scraping
- Gathery is focussed around user behaviour and goes where they go, including outside museum sites. If someone embeds in a blog or Wikipedia a pointer to an object on the Museum Of London site, it can be submitted to Gathery which will go to the "home" URL to look for the fuller record
- content owners get to decide the relationship between their data and the fields in POSH of whatever sort (a microformat, the m-object quasi-format I dreamt up, RDFa etc)
- data content can be in any number of forms other than the m-object snippet on the page itself
- POSH or microformats require explicit choices and action by website owners, whether they are museums or private individuals etc.
- the m-object content carrier part is inflexible [whereas screen-scraping is in some respects as versatile as the scraper's designer wishes]
- content creators have decided how to align their data with standard fields, as opposed to the gatherer (see also pros!)
Hoard.it (or screen-scraping approaches)
- scraping of (well-structured) content requires only that the template be built: nothing extra is required of the site owner
- it is adaptable to fit the available data and the preferred data model of the operator (the ingesting application), and to an extent the template creator
- clumsy, semantically overloaded HTML is avoided
- hoard.it includes a spider (though of course this is just as possible for a POSH-based application)
- when the bookmarklet is available (if it's not already) then hopefully users will be able to gather data from wherever, and apply existing or new template to it
- screen-scraping by definition depends on content at the surface of the web i.e. on web pages. All the content you wish to grab needs to be there
- data is rarely structured on screen in an unambiguous and subtle way. Data that is structured for machine-to-machine communication or indexing is. Using this where possible would therefore be better
- the scraper template designer tries to fit what they find to the data model they have, whilst the data's owner may have other ideas about the best fit
- fragile (part 1) - a change of HTML breaks the template and may make previous data models for an item unworkable, breaking the logical link between versions of an item
- fragile (part 2) - a change of location breaks the knowledge of an item because there is no concept of a home URL or a unique identifer.
- if users are declaring their own templates, there are no common standards for the data they are ingesting
This is all a bit of a muddle, and perhaps it's unwise to mix up discussion of two particular (and very alpha) applications with discussion of the two broad approaches they take to similar problems, but for me it's a good way of teasing out the merits of each. I also think that there's scope to combine them - for example, whilst hoard.it might continue with a gunslinging, screen-scraping, get it done approach, it could also advocate to museum web techs that they use some minimalist POSH (a bit like COINS) to uniquely identify their objects and give them a "home URL", ideally an end-point with proper structured data that it could also ingest (CDWALite or whatever). It could demonstrate the merit of this relatively easily. In this way something that didn't require too much from authors could add an extra dimension to an application that requires nothing from them at all, other than nice and regular HTML.