Addresses: what to derive from published sources, what to produce through crowdsourcing

We’ve done a lot of couting in the last few days, trying to measure the size of the problem of creating an open dataset of addresses for the UK. What we did not calculate yet, though, is:

– how many of those addresses we can source from some open, published sources,
– how many we can infer from other addresses through _computation_, and
– how many we need actual humans to leave their sofa and check for us.

Let’s spend a few words on each category.

## Published sources of full addresses

There are a few. They were reviewed by Open Addresses recently. The most substantial in size and quality is without any doubts Land Registry’s “Price Paid Data” (LRPP) that we’ve used already for [the “How many addresses in one town?” post](http://sociam-olaf.tumblr.com/post/124663267575/how-many-addresses-in-one-town). Unfortunately, LRPP and many other sources have one or both of the following two licensing issues:

– Their licence is “share-alike”, so the data is open but “not too open”, depending on [your definition](http://opendefinition.org/). The share-alike constraint may force the user to publish the derived data, and that is obviously not alwasys possible, e.g. in business applications. The biggest data source of this kind is Open Street Map (OSM): someone wrote [an entire paper](http://spatiallaw.blogspot.co.uk/2014/10/the-odbl-and-openstreetmap-analysis-and.html) of the limitations to reuse of OSM’s ODbL licence.
– Even when they are “officially open data”, they often are _derived_ from PAF or some other commercial address product and are not safe to re-use. LRPP is one of those: it was [confirmed using a Freedom Of Information request](https://www.whatdotheyknow.com/request/price_paid_dataset_followup) that addresses for new property sales are checked against OS’ AddressBase when they are recorded. Legally, one could then say that LRPP’s addresses are “derived” from AddressBase (definitely, we owe AddressBase the quality of LRPP’s addresses). To this day, Price Paid is still advertised as open data [on data.gov.uk](http://data.gov.uk/dataset/land-registry-monthly-price-paid-data), because the Open Government Licence (OGL) does not exclude the possibility that the data includes third party IP.

For the sake of our academic exercise, we will make the strong assumption that these issues are minor and we can actually use LRPP and OSM as input without wondering about the licensing detail. Note, though, that to be cautious we will **not** publish the dataset that is the result of our work. We may change our mind about that in the future…

## Inferred addresses

This is explained extensively in a blog post I wrote while I worked at Open Addresses, [here](https://alpha.openaddressesuk.org/blog/2015/02/12/inference).

## Crowdsourced addresses

That’s the interesting part from a social machine research perspective. Can we get all of the addresses we’re missing from the two categories above thanks to human contributors in some shape or form?

You may wonder: why not getting all of the addresses that way? That would be an enormous challenge, and there simply is no need for that. Some of the open datasets we’ve used every day for this research – such as Ordnance Survey’s “Open Names”, have no licensing issue and create a structure against which the human contributions are collected.

[John Murray](https://twitter.com/murraydata) – collaborator of mine when at Open Addresses, geospatial expert and academic to whom I owe most of what I understand of GIS today – told me once that, thanks to OS’ latest open data, the problem of creating an addresses dataset today can be seen like the task of decorating at Christmas, where house numbers and postcodes are like baubles and other hanger features on the tree. Before those datasets, we did not even have the tree!

In the following blog posts we will attempt quantifying the volume of addresses we need from each of the three sets.