SPARQL – What’s up with that?

The title of this post is intended to convey a touch of bewilderment through use of a phrase from the Cliff Clavin school of observational comedy.

Linked data and SPARQL

In the linked data world, SPARQL (SPARQL Protocol and RDF Query Language) is touted as the preferred method for querying structured RDF data. In recent years several high profile institutions have worked very hard to structure and transform their data into appropriate formats for linked data discovery and sharing, and as part of this, many have produced RDF triple (or quadruple) stores accessible via SPARQL endpoints – usually a web interface where anyone can type and run a SPARQL query in order to retrieve some of that rich linked data goodness.

This is admirable, but I have to admit to having had little success getting something out of SPARQL endpoints that I would consider useful. Every time I try to use a SPARQL facility I find I do better by scraping data from search results in the main interface. I have also increasingly become aware that I am not the only one to find it difficult.

RDF stores are different to relational databases; they are not so amenable to performing a search over the values on a particular field. Nor are they as flexible as text search databases like Solr. Instead they record facts relating entities to other entities. So it is important that as consumers of the data we know what kind of questions make sense and how to ask them in a way that yields useful results for us and does not strain the SPARQL endpoint unduly. If these are not the kind of questions we want to ask then we might need to question the application of SPARQL as the de facto way of accessing RDF triple stores.

I’d like to point out that my aim here is not to complain or to disparage SPARQL in general or anybody’s data in particular; I think it is fantastic so many institutions with large archives are making efforts to open up their data in ways that are considered best practice for the web, and for good reasons. However if SPARQL endpoints turn out to be flawed or inadequately realised, they will not get used and both the opportunity to use the data, and the work to produce it, will be wasted.

Problems with SPARQL endpoints

These are the problems I have commonly experienced:

  • No documentation of available vocabularies.
  • No example queries.
  • No access to unique identifiers so we can search for something specific.
  • Slowness and timeouts due to writing inefficient queries (usually without using unique ids or URIs).
  • Limits on the number of records which can be returned (due to performance limits).

Paraphrasing Juliette Culver’s list of SPARQL Stumbling Blocks on the Pelagios blog, here are some of the problems she experienced:

  • No query examples for the given endpoint.
  • No summary of the data or the ontologies used to represent it.
  • Limited results or query timeouts.
  • SPARQL endpoints are not optimised for full-text searching or keyword search.
  • No link from records in the main interface to the RDF/JSON for the record. (This is mentioned in relation to the British Museum, who provide a very useful search interface to their collection, but don’t appear to link it to the structured data formats available through their SPARQL endpoint.)

Clearly we have experienced similar issues. Note that some of these are due to the nature of RDF and SPARQL, and require a reconception of how to find information. Others are instances of unhelpful presentation; SPARQL endpoints are generally pretty opaque, but this can be alleviated by providing more documentation. With the amount of work it takes to prepare the data, I am surprised by how few providers accompany their endpoints with a clear list of the ontologies they use to represent their data, and at least a handful of example queries. This takes a few minutes but is invaluable to anybody attempting to use the service.

Nature provide the best example I have seen of a SPARQL endpoint, providing a rich set of meaningful example queries. Note also the use of AJAX for (minimal) feedback while query is running, and to keep the query on the results page.

Confusion about Linked Data

A blog post by Andrew Beeken of the JISC CLOCK project reports dissatisfaction with SPARQL endpoints and linked data, and provoked responses from other users of linked data:

“What is simple in SQL is complex in SPARQL (or at least what I wanted to do was) … You see an announcement about Linked Data and don’t know whether to expect a SPARQL endpoint, or lots of embedded RDF.” Chris Keene

“SPARQL seems most useful for our use context as a tool to describe an entity rather than as a means of discovery.” Ed Chamberlain

Chris’ point gives another perspective on linked data in general – what does it mean to provide linked (or should that be linkable) data, and how do we use it? Embedded RDF (RDFa) is good in that it tends to provide structured data in context, enriching a term in a webpage in a way that is invisible by default but that people can consume if they choose to. Ed indicates a fact about RDF as a data storage medium: it is a method of representing facts about entities which are named in an esoteric way; it is not structured in a way that is ideal for the freer keyword searching or SQL-style queries that we are used to.

Owen Wilson suggests the Linked Data Book‘s section 6.3 which describes approaches to consuming linked data, describing three architectural patterns. It looks worth a read to get one thinking about linked data in the right way.

Unique identifiers

“My native English, now I must forego” Richard II, Act 1, Scene 3

One of the tenets of linked data is that each object has a unique identifier. If we are looking for “William Shakespeare” we must use the URI or other identifier that represents him in the given scheme, rather than using the string “William Shakespeare”. It is thus also necessary that we have an easy way to access the unique identifiers that are used in the data, so that we can ask questions about a specific entity without forming a fuzzy, complex and resource-consuming query. The British Museum publicises its controlled terms, that is the limited vocabulary that they use in describing their collection, along with authority files which provide the canonical versions of terms and names, standardised in terms of spelling, capitalisation and so on, and thesauri which map synonymous terms to their canonical equivalents. These terms are used in describing object types, place names and so on, supporting consistency in the collections data. They are all available via the page British Museum controlled terms and the BM object names thesaurus. Armed with knowledge of what words are used in particular fields to categorise or describe entities in the data, and similarly with a list of ids or canonical names for things, we can then start to form structured queries that will yield results.

Shakespeare and British Museum

I have looked in particular at the British Museum’s SPARQL endpoint as an example, as BM is a project partner and because it has several items germane to Will’s World. To start with, the endpoint gives some context; a basic template query is included in the search box, which can be run immediately and which implicitly documents all the relevant ontologies by pulling them in to define namespaces. There is a Help link with some idea of how data is represented and can be accessed/referenced using URIs. All of this is good and I found it easy to get started with the endpoint.

However before long I came up against the problem I’ve had with other endpoints, namely that it is difficult to perform a keyword search, or at least perform a multi-stage search in order to (a) resolve a unique identifier for a keyword or named thing and then (b) retrieve information about or related to that thing. In this case I found a way to achieve what I needed by supplementing my usage of the SPARQL endpoint with keyword searches of the excellent Collection database search – and with some help from the technical staff at BM to resolve a couple of mistakes in my queries, can now harvest metadata about objects related to the person “William Shakespeare”.

It is reassuring to find out I am not alone in having difficulty retrieving and using SPARQL data. I followed Owen Stephen’s blog post about the British Museum’s endpoint with interest. Owen found the CIDOC CRM data model hard to query due to its (rich, but thereby counter-intuitive) multi-level structure. Additionally, he encountered the common issue that it is very difficult to perform a search for data containing or “related to” a particular entity which to start with is represented merely by a string literal such as “William Shakespeare”:

The difficulty of exploring the British Museum data from a simple textual string became a real frustration as I explored the data – it made me realise that while the Linked Data/RDF concept of using URIs and not literals is something I understand and agree with, as people all we know is textual strings that describe things, so to make the data more immediately usable, supporting textual searches (e.g. via a solr index over the literals in the data) might be a good idea.

Admittedly RDF representations and SPARQL are not really intended to provide a “search interface” in the sense to which most users are accustomed. But from the user’s perspective, there must be an easy way to start identifying objects about which we want to ask questions, and this  tends to start with performing some kind of keyword search. It is then necessary to identify the ids representing the resulting objects or records which are of interest. With the BM data this involves mapping a database id, which can be retrieved from the object URL, to the internal id used in the collections.

So what are the right questions?

Structured data requires a structured query – fair enough. However what sort of useful or meaningful query can we formulate when the data, the schema used to represent the data, the identifiers used within the data, are all specified internally? In order to construct an access point into the data, it is helpful to have not just a common language, but a common (or at least public) identifier scheme; canonical ways of referencing the entities in the data, such as “Shakespeare” or “the Rosetta Stone”. Without knowing the appropriate URI or the exact textual form (is it “Rosetta Stone”, “The Rosetta Stone”, “the Rosetta stone”? would we get more results for “Rosetta”?) it is nigh on impossible to easily ask questions about the entity of a SPARQL endpoint.

So how is one supposed to use a SPARQL endpoint? It is not a good medium for asking general questions or performing wide-ranging searches of the data. Instead it seems like a good way to link up records from different informational silos (BM, BL, NLS, RSC…) that share a common identifier scheme. If we know the canonical name of a work (“Macbeth”) or the ISBN of a particular edition, then we can start to link up these disparate sources of data on such entities.

But the variety of translations, the plurality of editions (which will only increase) and other degrees of freedom make it hard to perform an exhaustive analysis/usage of the data. In the case of the BM, who might have unique objects we don’t know we want to see, the way to find them is through keyword search. It seems that only by going first through a search interface or other secondary resource can we identify the items we want to know about and how to refer to them.

What we have in common between different sources is the language or ontologies used to describe the schema (foaf, dc, etc) – but this is syntax rather than semantics; structure rather than content. To echo Ed Chamberlain’s comment, we have access to how data is described, but not so much to the data itself.

 

British Museum data

The approach we will use to harvest British Museum metadata related to Shakespeare is outlined below. It is essentially the same approach that Owen Stephens found workable in his post on SPARQL, and involves reference to a secondary authority (the BM collection search interface) to establish identifiers.

  1. Conduct a search for “Shakespeare” in the collections interface.
  2. Extract an object id from each result. The Rosetta Stone has the id 117631.
  3. Find the corresponding collection id from SPARQL with this query:
    SELECT * WHERE { 
       ?s <http://www.w3.org/2002/07/owl#sameAs> 
          <http://collection.britishmuseum.org/id/codex/117631> 
    }
  4. The result should be a link to metadata describing the object, and the object’s collection id (in this case YCA62958) can be extracted for use in further searches.
    http://collection.britishmuseum.org/id/object/YCA62958
  5. If there is a result, retrieve metadata about the object from the URL: http://collection.britishmuseum.org/description/object/YCA62958.rdf (or .html, .json, .xml)
  6. If there is no result, scrape metadata from the object’s description page in the collections interface. There is plenty of metadata available, but it is far less structured than RDF, being distributed through HTML.

This last step looks like it will be quite common as many of the Shakespeare-related results are portraits or book frontispieces which have no collection id. I am not sure whether this is an omission, or because they are part of another object, in which case it will require further querying to resolve the source object (if that is what we want to describe).

Another difficulty is that although Owen found a person-institution URI for Mozart, I cannot find one for Shakespeare. There is a rudimentary biography but little else, so we do not have a “Shakespeare” identifier for use in SPARQL searches.

Conclusion

Ultimately I am still finding it non-trivial and a bit hacky to identify, and ask questions about, the real Shakespeare through a SPARQL endpoint.

YouTube Preview Image
In summary:

  • SPARQL endpoint providers could provide more documentation and examples.
  • RDF stores allow us to ask structural questions, but semantic questions are much harder without knowing some URIs.
  • It is often necessary to make use of a secondary resource or authority in order to identify the entities we wish to ask about.
Share

6 Responses to “SPARQL – What’s up with that?”

  1. The British Museum EndPoint is a beta release intended to generate feedback and provide information for a planned production version. As a beta it is running on limited resources and there are a number of issues and problems some of which are identified in the blog above but many not.

    However, we have received alot of feedback and are close to providing a production version with few technical limitations. The schema will also be very different to the one on the Beta and have schema designed to help with UI representation. We will also be releasing a cookbook so that we can transfer our knowledge to the community for more rapid development against the service.

    An intermediate version of the new scehema was released at http://annotate.oldman.me.uk/semantic/embed.html (happy to provide a PDF) but this will only provide a snapshot of the evolution of the schema (perhaps helpful for some of the issues identified by Owen Stephens)and the final diagram will be released with the production system (or before). A full cookbook will follow.

    An indication of the uses that we intend to put to the EndPoint schema can be found at http://www.researchspace.org/project-updates/developmentscreenshotsandlatestdesigns which has both screenshots and designs we are currently implementing.

    But ultimately I agree that documentation is everything for an implentation of this nature.

  2. Owen says:

    Excellent post – and glad my own experience with the British Museum data was of interest. You mention the BM object names thesaurus – I’m not sure if you saw I wrote a scraper to match terms from the object thesaurus to the British Museum URIs – this is available at https://scraperwiki.com/scrapers/british_museum_object_thesaurus/ if it’s of any help.

    You also noted you couldn’t find a URI for Shakespeare – I think it is http://collection.britishmuseum.org/id/person-institution/46005 – a general SPARQL query on this gives about 400 triples – http://bit.ly/Ppzj0x . Hope this helps

  3. Owen says:

    Thanks for this update Dominic – really pleased that this is moving forward :)

  4. neil.mayo says:

    Thanks Dominic, I think the beta endpoint is an excellent start and more usable than others I’ve seen. I look forward to the production version, which sounds promising. ‘Cookbook’ style documentation is a great idea. The data annotation tool looks cool too – I am sure this will simplify further annotation of British Museum holdings.

    It’s interesting to see the development of the endpoint through beta and feedback, as it seems that the producers and consumers of linked data are still not always sure how to approach one another from opposite sides of the data!

  5. neil.mayo says:

    Owen, thanks again for your comments – I had looked at your scraper but forgot to reference it. I’ll try and share any useful scraping tools that get written during the project via scraperwiki.

    Thanks also for the Shakespeare URI; not sure why I couldn’t find this but it will certainly prove useful!

  6. I’ve recently found how useful CONSTRUCT is in the context of SPARQL queries. For example:

    CONSTRUCT WHERE
    {
    ?s .
    ?s ?p1 ?o1 .
    ?o1 ?p2 ?o2
    }
    LIMIT 2000

    gives me usefully structured RDF to two levels about objects which depict Shakespeare. You start to see entries from the BM thesauri and all lots of usefully connected information.

    The problem with the standard SELECT operation in SPARQL is that you get this stupid non-standard XML response format which is neither use nor ornament.

    Richard

Trackbacks/Pingbacks

  1. w/c 27 August – Discovery News Roundup « Discovery - [...] a useful discussion on the Will’s World project blog of the challenges faced when attempting to consume linked data…

Leave a Reply

Your email address will not be published. Required fields are marked *

 

Powered by WordPress | Designed by Elegant Themes