Southampton addresses coverage through open sources

You’ll remember that sooner or later the objective of my research this year is to run a crowdsourcing exercise aimed at producing a set of addresses I am missing, testing some of the typical dynamics of such an initiative, such as incentives for participants, task type and size etc.

The last few posts focused on the size of the exercise. It is obvious that I can’t really plan where to aim my crowdsourcing magic at if I don’t really know what the picture looks like.

After what was discussed yesterday about what addresses I can source from published sources, which I can guess and which I need to produce for myself, it is useful now to quantify the size of those sets.

I was ironic a few days ago discussing how a few places – like my town of Berkhamsted – are much less likely than others to offer me volunteers or, in general, people accepting small compensations tipical of micro-task crowdsourcing to do the actual job. I need to shift my focus to some other places.

Southampton could be a good target, not only because I’m doing my PhD there and I have privileged access to my fellow students to promote the initiative, but because, in general, the student community may be more receptive to the challenge. Let’s see how the problem looks like there.

First of all, what do we call “Southampton”? Together with the actual city, it’s actually a set of 86 other villages, settlements etc.

https://gist.github.com/giacecco/eff1ef4ea942e33ff90b

How many addresses total?

Let’s then use the Office for National Statistics’ “National Statistics Addresses Lookup” (NSAL) trick to estimate the number of addresses: we get 237,702.

https://gist.github.com/giacecco/44eff5f4c9bc1727b40e

How many addresses published in open data sources?

Then, let’s see how many addresses we find already in Land Registry’s “Price Paid” (LRPP). Remember that, if we wanted to produce this data for real, we would not be able to use this dataset and many others. We get 91,933 addresses, about 39% of total. Intuitively, that sounds already very good.

https://gist.github.com/giacecco/6b603c0dd170570b98e6

(now, if I was ready, here I should also have imported addresses from another open dataset: Companies House’s “Free Data Product”, but we’ll talk about that another day)

How many addresses can be inferred?

I am re-using here some work that I originally did for Open Addresses. We’re going to use the most basic form of address inference that is explained in this blog post of mine.

There are 74,713 addresses in LRPP’s records for Southampton for which we know at least one house number.

https://gist.github.com/giacecco/a97c5527184efc4c5152

7,002 street + postcode couples are suitable to apply our algorithm to, as they give us at least two house numbers each.

https://gist.github.com/giacecco/f4f545cfd364865a231d

Note that one street can be covered by more than of one postcodes. We need to deal with the two or more “segments” a streed is divided into by postcodes as distinct inference opportunities.

Finally, let’s apply the algorithm to each of those opportunities. We generate 190,285 new addresses. That’s a lot! Possibly too many…

https://gist.github.com/giacecco/c10aa1a290673e3c5614

So, to recap:

we expected 237,702 addresses, from the number of UPRNs
we got 91,933 out of Land Registry’s “Price Paid”
we inferred (or “guessed”) 190,285 from those 91,933

… and in the end we got many more addresses than we were expecting: an extra 19% too many. This is both good and bad news.

The bad news is the inference algorithm we used is likely too optimist, as it suggests the existence of properties that likely aren’t there. This is a known weakness in the method, but in past experiments, on smaller input datasets, it never looked like it could become a problem of this magnitude.

Moreover, remember that we know nothing of any house names associated to those inferred addresses.

The good news is that the LRPP must be well spread over the territory, as it enables plenty of opportunities for inference.

… and the most interesting news is that perhaps we should think of the crowdsourcing exercise I need to plan not as an address creation one, but as an inferred address deletion one! Very, very intriguing.

The full code for this blog post and more is here.

Southampton addresses coverage through open sources

How many addresses total?

How many addresses published in open data sources?

Related posts: