About Me

My photo
Web person at the Imperial War Museum, just completed PhD about digital sustainability in museums (the original motivation for this blog was as my research diary). Posting occasionally, and usually museum tech stuff but prone to stray. I welcome comments if you want to take anything further. These are my opinions and should not be attributed to my employer or anyone else (unless they thought of them too). Twitter: @jottevanger

Thursday, May 29, 2008

Screen-scraping and POSH

I hesitate to put this post in thing-versus-another-thing terms, and I won't. For one thing I think that both the alternatives I'll discuss are not in opposition but compliment one another. But there are strengths and weaknesses to each.

Mike Ellis and Dan Zambonini recently unveiled hoard.it, which Mike showed me a draft of some time ago but which looks like it's come a long way. Hoard.it is basically not unlike Dapper in that you can teach it to screen-scrape sites, but instead of basically making an API onto those sites it will ingest the data and then offer it up through an API once scraped (not sure if this is live yet). It can screen-scrape by spidering specified sites. I don't know if their other plans are yet implemented but they also hope to let you scrape a given page via a bookmarklet, and to give the app the ability to match up a given page to templates in its memory. What they're showing is more targetted than the application is capable of because it's aimed specifically at gathering museum collections data, and displays it all appropriately, whereas of course the whole screen-scraping thing that hoard.it is capable of has many more uses than that. It's cool, and in typical style it's an example of "just get it done" tech designed to get something useful underway, even if it's but a stepping stone to something else.

Making templates for a screen-scraper is one way to gather data from the "surface" of the web. Another way to achieve something similar is to embed some sort of standardised "meaning" in the HTML. Microformats are one such route, as are various other flavours and approaches to Plain Old Semantic HTML (POSH). Early last year I put aside my effort to test this idea out for indicating and gathering museum objects. I called it Salad Bowl but Gathery has taken its place as my favoured moniker. Nothing else has changed in the last year, though.

Gathery is a test of two things, really: firstly, the idea of using microformat-like POSH for museum objects; and secondly, the dream I've long had of being able to gather things I liked as I explored the web. Technically there are three elements, I suppose: the POSH; a bookmarklet to identify any number of objects on a page and pass your selected one to the application; and the application that takes that data, processes it (including looking for better quality data at a "home" URL - say, an OAI-PMH source) and lets the user do stuff like tag and describe it, and ultimately feed it back out. It's functional but not complete (I laid it aside because I was unsure which direction to go with the POSH, not to mention being quite busy), but in many ways it's similar to what Dan and Mike are doing with hoard.it. When you boil it down, the differences come down to the screen-scraping as opposed to the POSH approach.

So I've been trying to draw out the pros and cons of hoard.it and Gathery (the latter will probably never come to anything), but essentially it's a comparison of screen-scraping and POSH (though a pro for one isn't automatically a con for the other). Bear in mind that I have at the back of my mind a set of questions relating to how we move towards a semantic web for museums, as well as how to achieve the things I've always wanted as a user, namely an easy way to search and gather material from collections all round the web. Obviously my comparison is based on a good knowledge of what I built and pretty thin knowledge about hoard.it, and since I've not publicised Gathery it's not easy for you to test the comparison (though do ask for the URL if you're interested). But take think my pros and cons as relating to screen-scraping and POSH and let me know if you think they cover the important aspects, and tell me where I'm wrong and what I'm missing.

Gathery (or microformat/POSH approaches)
Pros:

  • m-object, The POSH I drafted and tested with Gathery, has two levels: content indicator (points to the "home" URL where the best content resides, includes a GUID and an institutional identifier); and content carrier.
  • content carrier is optional, so it is not necessary for content to reside on the page: the author can choose to do no more than indicate that an object's record exists somewhere else
  • authors using explicitly-chosen standards should be less fragile than screen scraping
  • Gathery is focussed around user behaviour and goes where they go, including outside museum sites. If someone embeds in a blog or Wikipedia a pointer to an object on the Museum Of London site, it can be submitted to Gathery which will go to the "home" URL to look for the fuller record
  • content owners get to decide the relationship between their data and the fields in POSH of whatever sort (a microformat, the m-object quasi-format I dreamt up, RDFa etc)
  • data content can be in any number of forms other than the m-object snippet on the page itself

Cons:

  • POSH or microformats require explicit choices and action by website owners, whether they are museums or private individuals etc.
  • the m-object content carrier part is inflexible [whereas screen-scraping is in some respects as versatile as the scraper's designer wishes]
  • content creators have decided how to align their data with standard fields, as opposed to the gatherer (see also pros!)

Hoard.it (or screen-scraping approaches)
Pros:

  • scraping of (well-structured) content requires only that the template be built: nothing extra is required of the site owner
  • it is adaptable to fit the available data and the preferred data model of the operator (the ingesting application), and to an extent the template creator
  • clumsy, semantically overloaded HTML is avoided
  • hoard.it includes a spider (though of course this is just as possible for a POSH-based application)
  • when the bookmarklet is available (if it's not already) then hopefully users will be able to gather data from wherever, and apply existing or new template to it

Cons:

  • screen-scraping by definition depends on content at the surface of the web i.e. on web pages. All the content you wish to grab needs to be there
  • data is rarely structured on screen in an unambiguous and subtle way. Data that is structured for machine-to-machine communication or indexing is. Using this where possible would therefore be better
  • the scraper template designer tries to fit what they find to the data model they have, whilst the data's owner may have other ideas about the best fit
  • fragile (part 1) - a change of HTML breaks the template and may make previous data models for an item unworkable, breaking the logical link between versions of an item
  • fragile (part 2) - a change of location breaks the knowledge of an item because there is no concept of a home URL or a unique identifer.
  • if users are declaring their own templates, there are no common standards for the data they are ingesting

This is all a bit of a muddle, and perhaps it's unwise to mix up discussion of two particular (and very alpha) applications with discussion of the two broad approaches they take to similar problems, but for me it's a good way of teasing out the merits of each. I also think that there's scope to combine them - for example, whilst hoard.it might continue with a gunslinging, screen-scraping, get it done approach, it could also advocate to museum web techs that they use some minimalist POSH (a bit like COINS) to uniquely identify their objects and give them a "home URL", ideally an end-point with proper structured data that it could also ingest (CDWALite or whatever). It could demonstrate the merit of this relatively easily. In this way something that didn't require too much from authors could add an extra dimension to an application that requires nothing from them at all, other than nice and regular HTML.

2 comments:

Mike said...

Hey Jeremy,

Great post - sums up the two approaches very well. Ditto, like you, I'm not about to "defend" the hoard.it approach and "attack" the Gathery approach. Again, like you, I think the two should be able to work very harmoniously together.

The vision I have for hoard.it (and I think DZ is the same, but because it's alphatastic, haven't got round to building it yet) is that hoard.it should take the best of both approaches. On the one hand, we all know that adding "semantic" tags to existing data is hard, involves buy-in, time, agreement, etc. On the other, we also all recognise that this is a more accurate approach...

I see the hoard.it scraper ultimately working cascade-style on the data (non-complete list coming up - it could be tiered as much as needed):

1. Looks for existence of a programmatically available API

if none then...

2. Looks for POSH stuff, referenced (eek) by some kind of standard REL tag, "schema" or some other mechanism

if none then...

3. Looks for META stuff in the HEAD

if none then..

[insert other - increasingly less "accurate" pointers here]

...

finally:

if none then..

n. default back to the (x)html and a "known" template for the page and use that to scrape the data

We jumped straight to "n" because - yes, you got it - we wanted to "just do it"..

NOW...

This got me thinking that there are several interesting directions in which this could go, plus some questions:

1. Direction: provide some means by which website owners could attach the "template relationship" between a *group* (important point) of pages and the content on those pages. I'm thinking a REL tag - a GRDDL approach but not so damn heavy!

2. Question: could the hoard.it application *generate* this relationship in the same way the site owner could add it? Yes...and that'd be cool...

3. Question: could we use the power of the crowd to help *define* these relationships? Yes...and that'd be super cool...

4. Question: stepping away from the "ideal" (you know how much I love to spend time in reality and not ideality...) - how much do templates actually change? Really? Not much...

5. Question: where does *storing* the data (as opposed to spidering and munging on the fly) fit into the equation?...Frankly, I don't know...

So...

Where does it go? I'm not sure we know :-) but it's gonna be interesting, either which way.

Something is itching my brain about the way in which content could be exposed and granularised on-page, and I'm not sure I've quite scratched that itch yet...

cheers!

Mike

Jeremy said...

Hi Mike,
Sorry it's taken so long to reply to your thoughtful comment. I saw you were going to be away and thought that would give me a window to write a decent response. But I didn't.

Anyway, I guess you could say my vision for Gathery was not dissimilar, with the exception being that it was also to be a test application of the POSH idea, wherein essentially a single template is already defined. hoard.it is perhaps a better solution in the wider scheme of things, given, as you say, the reality of no standardised data on web pages at present.

Gathery has built into it the facility to infer from minimal data the place to find more data, which is perhaps something that you might consider doing with hoard.it too. This is perhaps more oriented at the serendipitous-gathering-by-bookmarklet approach than the organised spidering approach. Obviously it can be done in your current screen-scraper template style, although some sort of convention would still help. The actual home of the fuller data might be the DC in the head of an HTML page, a body HTML in a record page (for which a hoard.it template exists), or a CDWALite or OAI record out there somewhere. Just an idea of how the two approaches might come still closer together.

For "direction", I'm glad you agree that it would be sensible to have some way to indicate in the head where to go to see how the HTML relates to some established set of fields. But I don't think you should consider too lightly taking a novel approach to defining this relationship. If microformats, eRDF and other approaches are looking at GRDDL to do the job it's perhaps worth exploring, unless the only application you have in mind is hoard.it (which would be a shame, I think). GRDDL may be tricky but why shouldn't hoard.it do the work of writing it, as per your question 2? (but poor Dan!) Users define the template, from which GRDDL is inferred.

As for where the data is stored, my model was to (a) compare the source URL to known domains for the museum the object is identified with (b) gather and store what is on the page (c) if the page points somewhere else, store that instead (d) always choose data from the object's owner's known domain in preference to other sources (e) always choose structured machine-facing data in preference to scraped data (but subject to (d)) (f) store not just the extracted data but a blob of XML for the master data, in case there are fields in it that aren't mapped in Gathery (g) save all URLs where the object has been found.

The problem scratching away at you is perhaps related to the problem that got me onto the Gathery approach in the first place: on any given page there's no guarantee that only one object will be present, nor that all the pertinent information about that object will be on it. Whilst it asks quite a bit of the author, the POSH approach meant that you could stick as many items as you liked on a page, in any ad hoc arrangement you please, and not need to include all the data there to know it would be picked up, relying instead on a "home" URL for the objects.

Perhaps the concept of the home URL may be something to consider importing to hoard.it too. I think this is at least as important as pointing to a transformation of whatever sort. Obviously it's not something to depend upon but its a way that machines could tell your spider where to look for better quality records, with stable URLs!

Here's an example of what a data end-point might be like, by the way. This is good too. And it would be nice to see authority records about subjects, people, places having home URLs too...