SPARQL and Shine

Did you know that you can extract information held on Wikipedia in a dataset? DBpedia makes it all possible. I wondered if I could write a query to extract details of disused stations in Scotland and map them.

About DBpedia

DBpedia is a project that is aiming to make Wikipedia content available in a structured way. It uses the Resource Description Framework (RDF) to represent the extracted information Data can be accessed using using an SQL-like query language for RDF called SPARQL. The Open Data Publishing Platform that is under development by the Scottish Government will use the same technology to make statistical data more accessible. So I’m very interested in finding out more about RDF and SPARQL.

Getting the data

DBpedia data can be accessed by writing a SPARQL query. A query to access disused stations in Edinburgh is below:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT * WHERE {
{?subject <http://purl.org/dc/terms/subject>
<http://dbpedia.org/resource/Category:Disused_railway_stations_in_Edinburgh>}
?subject rdfs:label ?name.
OPTIONAL {?subject geo:lat ?lat.}
OPTIONAL {?subject geo:long ?long.}
OPTIONAL {?subject <http://www.w3.org/ns/prov#wasDerivedFrom> ?link.}
OPTIONAL {?subject foaf:depiction ?image.}
OPTIONAL {?subject <http://dbpedia.org/ontology/district> ?district.}
OPTIONAL {?subject dbpedia-owl:openingYear ?openingYear.}
FILTER (langMatches(lang(?name), “EN”)).
}

The query finds all Wikipedia pages under the Disused railway stations in Edinburgh Category. It creates a variable called “name” to hold the names of the stations. If values are missing for any selected variable then nothing will be displayed. So the other variables are specified as “optional” which means that if they can’t be found the record will still be displayed: latitude, longitude, link to Wikipedia page, link to the image on the Wikipedia page, district (usually local authority), and the year the station was opened. The filter limits results to pages written in English. The query can be found and submitted here: Disused Edinburgh Stations SPARQL.

The query looks a bit complicated, but it’s a lot easier once you work out what all the variables refer to. You can find out what variables are available by inspecting the DBpedia page for a particular Wikipedia page. Here is the DBpedia page for Princes Street Station.

At the above link to the SPARQL query there are options to get the results as JSON or XML. I find it tricky to work with data in these formats, but there is a place where you can run queries and get the data as a csv: Virtuoso SPARQL query editor.

My final query for all disused stations in Scotland and the resulting output can be found at: Disused Stations SPARQL.

Note about data quality from Wikipedia

I mentioned above about specifying variables as optional if there were missing values, but you still wanted the record. This is necessary because Wikipedia is a work in progress. A lot of the stations don’t have details such as an opening year, some don’t have their own page (e.g. Hawick station appears under the Waverley line page), and some don’t even appear on Wikipedia (e.g Leith Walk railway station). So my map will be incomplete, although I could manually add missing data if I could be bothered. This demonstrates the potential of using DBpedia to search for where there is missing information on Wikipedia with the intention of filling in the gaps.

Mapping the data

I decided to map the data using Google Fusion tables because it’s quick and easy. I imported the csv as a new Fusion table, mapped the locations using latitude and longitude, and created some custom info windows. I decided it would be nice to include photos from Wikipedia if they existed, but didn’t want to have large empty info windows when there were no photos. Luckily there is a way to do this: see Using Dynamic Templating. Below is the code I used. It also provides a link to the Wikipedia page and shows the date the station opened if that information is available.

{template .contents}
<div style=”width: 300px; {if $data.value.image}height: 200px;{/if} overflow: auto”>
<div style=”align: center”>
<a href=”{$data.value.link}”>
<b> {$data.value.name} </b>
</a>
</div>
<br>
{if $data.value.image}
<img src=”{$data.value.image}” width=”170″>
{else}
<i>No image available</i>
{/if}
<br>
{if $data.value.openingYear}
Opened in {$data.formatted.openingYear}
{else}
{/if}
</div>
{/template}

Map of Disused Stations in Scotland

My final map is shown below. There are a lot of stations missing because the necessary information doesn’t exist on Wikipedia. I could add more information manually myself. Alternatively I could re-run my query at a later date when hopefully more of the information has been added to Wikipedia. If I really knew what I was doing then I could get the map to run directly from a SPARQL query rather than a csv. Someone has done just that for European capital cities and this is where the benefits of accessing data via SPARQL really start to become apparent.
Tagged with: , , , , , , , , , ,