Loading the British National Bibliography into an RDF Database

This is the second in a series of posts (1, 2, 3, 4) providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

Note: while I’ve attempted to fix up these instructions to account with changes to the software and how the data is published, there may still be some errors. If there are then please leave a comment or drop me an email and I’ll endeavour to fix.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is published under a public domain license and is available for access online or as a bulk download.

This tutorial provides developers with guidance on how to download the BNB data and load it into an RDF database, or “triple store” for local processing. The tutorial covers:

  • An overview of the different formats available
  • How to download the BNB data
  • Instructions for loading the data into two different open source triple stores

The instructions given in this tutorial are for users of Ubuntu. Where necessary pointers to instructions for other operating systems are provided. It is assumed that the reader is confident in downloading and installing software packages and working with the command-line.

Bulk Access to the BNB

While the BNB is available for online access as Linked Data and via a SPARQL endpoint there are a number of reasons why working with the dataset locally might be useful, e.g:

  • Analysis of the data might require custom indexing or processing
  • Using a local triple store might offer more performance or functionality
  • Re-publishing the dataset as part of aggregating data from a number of data providers
  • The full dataset provides additional data which is not included in the Linked Data.

To support these and other use cases the BNB is available for bulk download, allowing developers the flexibilty to process the data in a variety of ways.

The BNB is actually available in two different packages. Both provide exports of the data in RDF but differ in both the file formats used and the structure of the data.

BNB Basic

The BNB Basic dataset is provided as an export in RDF/XML format. The individual files are available for download from the BL website.

This version provides the most basic export of the BNB data. Each record is mapped to a simple RDF/XML description that uses terms from several schemas including Dublin Core, SKOS, and Bibliographic Ontology.

As its provides a fairly raw version of the data, BNB Basic is likely to be most useful when the data is going to undergo further local conversion or analysis.

Linked Open BNB

The Linked Open BNB offers a much more structured view of the BNB data.

This version of the BNB has been modelled according to Linked Data principles:

  • Every resource, e.g. author, book, category, has been given a unique URI
  • Data has been modelled using a wider range of standard vocabularies, including the Bibliographic Ontology, Event Ontology and FOAF.
  • Where possible the data has been linked to other datasets, including LCSH and Geonames

It is this version of the data that is used to provide both the SPARQL endpoint and the Linked Data views, e.g. of The Hobbit.

This package provides the best option for mirroring or aggregating the BNB data because its contents matches that of the online versions. The additional structure to the dataset may also make it easier to work with in some cases. For example lists of unique authors or locations can be easily extracted from the data.

Downloading The Data

Both the BNB Basic and Linked Open BNB are available for download from the BL website

Each dataset is split over multiple zipped files. The BNB Basic is published in RDF/XML format while the Linked Open BNB is published as ntriples. The individual data files can be downloaded from CKAN although this can be time consuming to do manually.

The rest of this tutorial will assume that the packages have been downloaded to ~data/bl

Unpacking the files is a simple matter of unzipping them:

cd ~/data/bl
unzip *.zip
#Remove original zip files
rm *.zip

The rest of this tutorial provides guidance on how to load and index the BNB data in two different open source triple stores.

Using the BNB with Fuseki

Apache Jena is an Open Source project that provides access to a number of tools and Java libraries for working with RDF data. One component of the project is the Fuseki SPARQL server.

Fuseki provides support for indexing and querying RDF data using the SPARQL protocol and query language.

The Fuseki documentation provides a full guide for installing and administering a local Fuseki server. The following sections provide a short tutorial on using Fuseki to work with the BNB data.

Installation

Firstly, if Java is not already installed then download the correct version for your operating system.

Once Java has been installed, download the latest binary distribution of Fuseki. At the time of writing this is Jena Fuseki 1.1.0.

The steps to download and unzip Fuseki are as follows:

#Make directory
mkdir -p ~/tools
cd ~/tools

#Download latest version using wget (or manually download)
wget http://www.apache.org/dist/jena/binaries/jena-fuseki-1.1.0-distribution.zip

#Unzip
unzip jena-fuseki-1.1.0-distribution.zip

Change the download URL and local path as required. Then ensure that the fuseki-server script is executable:

cd jena-fuseki-1.1.0
chmod +x fuseki-server

To test whether Fuseki is installed correctly, run the following (on Windows systems use fuseki-server.bat):

./fuseki-server --mem /ds

This will start Fuseki with a empty read-only in-memory database. Visiting http://localhost:3030/ in your browser should show the basic Fuseki server page. Use Ctrl-C to shutdown the server once the installation test is completed.

Loading the BNB Data into Fuseki

While Fuseki provides an API for loading RDF data into a running instance, for bulk loading it is more efficient to index the data separately. The manually created indexes can then be deployed by a Fuseki instance.

Fuseki is bundled with the TDB triple store. The TDB data loader can be run as follows:

java -cp fuseki-server.jar tdb.tdbloader --loc /path/to/indexes file.nt

This command would create TDB indexes in the /path/to/indexes directory and load the file.nt into it.

To index all of the Linked Open BNB run the following command, adjusting paths as required:

java -Xms1024M -cp fuseki-server.jar tdb.tdbloader --loc ~/data/indexes/bluk-bnb ~/data/bl/BNB*

This will process each of the data files and may take several hours to complete depending on the hardware being used.

Once the loader has completed the final step is to generate a statistics file for the TDB optimiser. Without this file SPARQL queries will be very slow. The file should be generated into a temporary location and then copied into the index directory:

java -Xms1024M -cp fuseki-server.jar tdb.stats --loc ~/data/indexes/bluk-bnb >/tmp/stats.opt
mv /tmp/stats.opt ~/data/indexes/bluk-bnb

Running Fuseki

Once the data load has completed Fuseki can be started and instructed to use the indexes as follows:

./fuseki-server --loc ~/data/indexes/bluk-bnb /bluk-bnb

The --loc parameter instructs Fuseki to use the TDB indexes from a specific directory. The second parameter tells Fuseki where to mount the index in the web application. Using a mount point of /bluk-bnb the SPARQL endpoint for the dataset would then be found at:

http://localhost:3030/bluk-bnb/query

To select the dataset and work with it in the admin interface visit the Fuseki control panel:

http://localhost:3030/control-panel.tpl

Fuseki has a basic SPARQL interface for testing out SPARQL queries, e.g. the following will return 10 triples from the data:

SELECT ?s ?p ?o WHERE {
  ?s ?p ?o
}

For more information on using and administering the server read the Fuseki documentation.

Using the BNB with 4Store

Like Fuseki, 4Store is an Open Source project that provides a SPARQL based server for managing RDF data. 4Store is written in C and has been proven to scale to very large datasets across multiple systems. It offers a similar level of SPARQL support as Fuseki so is good alternative for working with RDF in a production setting.

As the 4Store download page explains, the project has been packaged for a number of different operating systems.

Installation

As 4Store is available as an Ubuntu package installation is quite simple:

sudo apt-get install 4store

This will install a number of command-line tools for working with the 4Store server. 4Store works differently to Fuseki in that there are separate server processes for managing the data and serving the SPARQL interface.

The following command will create a 4Store database called bluk_bnb:

#ensure /var/lib/4store exists
sudo mkdir -p /var/lib/4store

sudo 4s-backend-setup bluk_bnb

By default 4Store puts all of its indexes in /var/lib/4store. In order to have more control over where the indexes are kept it is currently necessary to build 4store manually. The build configuration can be altered to instruct 4Store to use an alternate location.

Once a database has been created, start a 4Store backend to manage it:

sudo 4s-backend bluk_bnb

This process must to be running before data can be imported, or queried from the database.

Once the database is running a SPARQL interface can then be started to provide access to its contents. The following command will start a SPARQL server on port 8000:

sudo 4s-httpd -p 8000 bluk_bnb

To check whether the server is running correctly visit:

http://localhost:8000/status/

It is not possible to run a bulk import into 4Store while the SPARQL process is running. So after confirming that 4Store is running successfully, kill the httpd process before continuing:

sudo pkill '^4s-httpd'

Loading the Data

4Store ships with a command-line tool for importing data called 4s-import. It can be used to perform bulk imports of data once the database process has been started.

To bulk import the Linked Open BNB, run the following command adjusting paths as necessary:

4s-import bluk_bnb --format ntriples ~/data/bl/BNB*

Once the import is complete, restart the SPARQL server:

sudo 4s-httpd -p 8000 bluk_bnb

Testing the Data Load

4Store offers a simple SPARQL form for submitting queries against a dataset. Assuming that the SPARQL server is running on port 8000 this can be found at:

http://localhost:8000/test/

Alternatively 4Store provides a command-line tool for submitting queries:

 4s-query bluk_bnb 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'

Summary

The BNB dataset is not just available for use as Linked Data or via a SPARQL endpoint. The underlying data can be downloaded for local analysis or indexing.

To support this type of usage the British Library have made available two versions of the BNB. A “basic” version that uses a simple record-oriented data model and the “Linked Open BNB” which offers a more structured dataset.

This tutorial has reviewed how to access both of these datasets and how to download and index the data using two different open source triple stores: Fuseki and 4Store.

The BNB data could also be processed in other ways, e.g. to load into a standard relational database or into a document store like CouchDB.

The basic version of the BNB offers a raw version of the data that supports this type of usage, while the richer Linked Data version supports a variety of aggregation and mirroring use cases.

Posted in Benefits of open data, Innovation, Posts from feeds Tagged with: , ,