One of the challenges in aggregating metadata is to unify the different schemata of the data sources, to make the data searchable through a single common schema. There are advantages to maintaining the original schema (field names) of each data source so that it can be searched in the same way that would have been possible in its original context (i.e. searching under the same field or fields). However this can complicate searches of the aggregated metadata when each data source represents similar information under a differently-named field – because a search for values on a range of essentially synonymous but differently-named fields across a variety of resources would be more complex to specify. Therefore it makes sense to map similar fields onto a single field, within a schema which is specific to the collection of aggregated data.
This post is about creating a schema for Will’s World and mapping other metadata into it. Despite the technical detail, it might elaborate some common stumbling blocks to be found in mapping disparate information to a single representation.
We had to develop a schema for Will’s World (WW) to represent the metadata which is being imported into the text database. This schema has several requirements:
Once the schema is designed, it is necessary to define how the metadata we collect from disparate sources will fit into the (probably more restrictive) schema that we have elected to use in our representations. This involves establishing a clear mapping from the source schemata to the target schema. Additionally, we need more than one schema for Will’s World:
This post concentrates on the metadata schema, which is the most important and elaborate. We have defined a list of fields with namespaced identifiers, and for each field we define several properties:
For the purposes of supporting text search, we need to answer a couple of questions for each field:
The schema for representing metadata about services is much simpler and records identifying metadata such as id, name, description, URL, provenance and usage/licensing information, and more specific metadata about how to use the service.
The next stage is to collect, transform and ingest the data into Solr. This is a multi-stage process:
Due to size and memory restrictions, any of the later stages, particularly the transformations and ingest, may involve splitting the documents into multiple (well-formed) documents.
We have to take a variety of approaches to collecting the data, depending on what types are available. Only a relatively small proportion of providers actually make their data available in easy to collect structured formats, while in other cases the data is present directly in HTML pages, or as the result of searches through an HTML web interface.
Depending on the source of the data, it may be processed via scripts which construct structured data from un- or less structured data through a process of parsing and recombining data, or by an XSLT script which performs a simpler and more direct mapping of existing fields to the fields of the schema. This latter may involve many-to-one mappings in addition to direct one-to-one mappings. One-to-many mappings should however be avoided – this would produce data duplication and likely be an indication that the schema was not well defined.
In designing a metadata schema which is intended to unify metadata from sources with different concerns, we inevitably encounter certain issues which demand to be resolved in a consistent way.
Naturally we need an identifying field so that every record can be distinguished and uniquely identified. We have two options for this – generate unique id fields (the easiest way would be to ask Solr to impose them), or map external values (from each source’s data) into an id field. The first option requires us to maintain unique identifiers upon updates, when we re-index data. The second option runs the risk of clashes. Identifying tuples or values sourced externally will vary a great deal and there is always the danger of non-uniqueness unless we are very careful about how we define them. It might be best to let Solr create a unique field value for indexing, and record an alternative ‘primary’ identifier from the identifying metadata of each particular data source. This might be considered a unique identifier within the records of the particular source.
If the anticipated metadata informing a required field are not present in the source data, a default or “unknown” value must be given. This can be achieved by explicitly producing a default value from a script, or by providing a template in an XSLT transform for each required field, which matches when the source field is not present in a record. Here is an example enforcing the presence of a required dc:title field for a Solr import doc:
<xsl:if test="not(dc:title)"> <field name="dc:title">Unknown</field> </xsl:if>
What should we do when the source data that we want to insert as a field value incorporates tag structure of its own? We would like to maintain XML substructure where available as part of the value of the field, rather than discarding it. However the Solr import schema does not accept substructure to its “field” elements, so we must enclose them in CDATA tags. This is not easy using XSLT – unless we are willing to write templates to match each of the elements we want to use as subelements in the defined fields of the schema, and encode them appropriately. The solution is to pass the node which contains substructure to a target which wraps it in a CDATA section in the output like this:
<!-- Copy the current element and subelements to the output in a CDATA section --> <xsl:template name="xmlsection"> <!-- Use the current node by default --> <xsl:param name="node" select="."/> <xsl:text disable-output-escaping="yes"><![CDATA[ <![CDATA[ ]]></xsl:text> <xsl:copy-of select="$node"/> <xsl:text disable-output-escaping="yes"><![CDATA]]]><![CDATA[>]]></xsl:text> </xsl:template>
As indicated earlier, there are several cases where we wish to map many source fields from a provider into one Will’s World field as subelements. An example of matching multiple fields at the same level which will contribute to the content of a single field is in National Library of Scotland data, where we want to match and combine the XML elements “summary”, “commentarytitle” and “commentary” into a single “dc:description” field. There are a few options for performing this aggregation; if we know that one of the source fields (say, “summary”) will always be available, we can match on it and then pull in values from the other fields as necessary using xsl:copy-of with a relative path in the select attribute. However, if the “summary” field is missing, the template will not match and we will lose the values of the other fields.
If we cannot be sure that any one of these fields will be available, that is, they are all optional, then we need to match every possible combination. Doing this via a number of similar templates is unnecessarily verbose and will also lead to duplicate output fields, as every one of the fields which is present will match and cause the output field to be written. The solution is to provide an OR-ed list of the possible combinations of these elements, so it will only match once and produce one instance of the output field. In the following simplified example, we want to add elements f1 and f2, which occur at the same level, and either of which may or may not be present, to a single output element – and only do it once. We therefore need to check for the presence of both elements, or of either element without the other:
<!-- Match f1 and/or f2 --> <xsl:template match="f1[../f2]|f1[not(../f2)]|f2[not(../f1)]"> <combined-field> <xsl:copy-of select="../f1" /> <xsl:copy-of select="../f2" /> </combined-field> </xsl:template>
Note that the XPath expression will become increasingly complicated if we are combining more than two potentially absent elements. In such a situation, it would be desirable to find other ways to do this, to avoid a combinatorial explosion of match expressions.
We scrape data on the granularity of a page. Sometimes this yields metadata at different levels. For example, the BBC’s “Shakespeare’s Restless World” is a radio programme produced in conjunction with the British Museum, exploring an event or period in history through a particular object from the museum’s collections. The pages describing each show tend to contain metadata about the show, and also metadata about the object around which the show is centred. These can either be mapped to a description of a single entity representing “the programme about the object”, or the metadata can be recorded independently for both the programme and the object.
In this case we conceptualise the object as the resource being described; metadata about the programme (title, description, URL to a podcast) will be recorded in a record describing the programme, while supplementary detail (background/object facts, quotes) will be incorporated into the text description (using appropriate tags) and also distributed where possible into a record that describes the object. A page listing objects by play is scraped separately in order to supply associated plays to each programme/object.
Aggregating a wide range of metadata by mapping its values to the fields of a single common target schema renders it searchable in a unified interface. The mapping effort does however throw up a number of issues, some of which have been described in this post along with potential solutions. In particular:
I found that developing the schema was most easily done in a spreadsheet or table format, so I wrote scripts to transform that specification (at least partially) into (a) an XML schema and (b) an HTML layout of the same. To produce a schema, we produce elements with minOccurs and maxOccurs attributes. If the field is required, minOccurs is 1, otherwise it is 0. If the field is repeatable, maxOccurs is “unbounded”, otherwise it is 1. Here is the script for generating a schema file from a CSV description of the WW fields.
Disclaimer: this script is quick and dirty, non-validating, and WW-specific. Use it at your own risk; or better, adapt it!