Understanding our new API: part 1

James100x100

IMG_9568

Hi I’m James, the technical team lead for the API workstream at TfL Online. I am writing this post because there has been some talk in the press about our new API, and as Phil mentioned in his last post we are preparing to release our third major update to beta.tfl.gov.uk.  This update (drop 3) will add significant new functionality that we hope you will find intuitive and helpful. I thought it might be useful to start by giving you a peek at some of the work that goes on here, behind the scenes.

 

To give you some background, TfL was formed by the merger of several companies in 2000.  This legacy means that we have had to deal with a degree of disparity and inconsistency in our public offerings.

As a result, one of the key challenges we have faced is understanding how the many and disparate data sources we have available from within TfL and from it’s suppliers relate to each other; how they flow around the organisation, feed into our systems, and can be used to provide you with a more accurate, low latency data model.  We’ve put a lot of work into establishing a new canonical model that allows us to relate these data sources – for the very first time – by using industry and government standard specifications and descriptors such as NaPTAN, TransXChange & GeoJSON in order to make the site easier to use. This will be music to a number of developer’s ears, who have so far had to struggle to make sense of some of our open data.

Underpinning the website, and soon to be available as a public beta, is our new RESTful API.  Our aim is to present a one-stop shop for all publically shared TfL data – the same data that we use ourselves to present the website to you.

Getting Technical

Apologies, but it’s about to get a bit more technical. Why? Because I’d like to share the finer details of this exciting work with the techies out there, while hopefully conveying some of the scale and complexity of the challenge we’ve been tasked with to the rest of you…

Behind the API lies a set of background services that continuously extract, transform and load ‘reference’ and ‘real-time’ data from source and into our model.  The 50+ load tasks we have built so far can read files, call out to APIs, and consume streams (and push services) to process and load data in – in many different formats and structures every minute of every day, 7 days a week.

There is a mixture of slow churn reference data (lines, modes of transport, fares etc), medium longevity data that only changes from time to time (bus routes, timetables, station facilities etc), and real-time, ephemeral data (live arrivals, status and disruptions etc).  Each entity has its own lifetime and both co-exists relate to other entities in the model.  Therefore, one of the most important working parts is how we cache and expire data appropriately for best performance and accuracy.

So what happens when the underlying data changes?  We have implemented a notification system that allows us to subscribe to events on the model, so that when a process updates data, it enqueues a broadcast to subscribers to that event.  The dequeuing service then invokes a callback to the cache to invalidate that data.

Our cache is the principal subscriber to events, but the developers among you might be interested to learn that we are planning to expose endpoints in the API to allow you to subscribe to a sub-set of these events, typically using a webhook.  We will then fire a request to your resource, over HTTP, as events occur.  More information will be made available when we have a public beta of the API ready for you to try.

The next article in this series will focus on geolocation. Drop 3 brings this to the API in the form of “Nearby” functionality via the Places endpoint.

Watch this space…

Tagged with: , , , , ,