Spin Up the TIGER Geocoder, Quickly

Geocoding is expensive. Translating ‘520 Broadway, NYC’ into ‘40.7226687,-73.9987579′ comes with caveats. You’re up against network endpoints, daily request limits, licensing, and potentially a large bit of cash, depending on your needs. When your collection of addresses is on the order of millions or hundreds of millions of rows, open source solutions to the geocoding stack become essential. We’re extremely excited to see what companies like Mapzen and CartoDB have in store for free and improved access to mapping and geocoding, and we’re hoping that we can help, too.

To this end, we are releasing two tools so that anyone can stand up a geocoder backed by the U.S. Census’ TIGER dataset.

When it comes to public data, being able to have a geographical awareness of the things described in the rows and columns is essential. The EPA publishes data on the accidental release of toxic substances by industrial sites. But to analyze impact on say, cancer rates or aquifers, we need to convert string based addresses to a latitude and a longitude, in order to connect pollution to neighboring rivers and community.

There is already a lot of open-source energy focused on making use of the TIGER dataset, and thanks to the work of the community, there is an excellent extension to PostGIS that not only auto-populates your PostgreSQL database with these TIGER shapefiles, but also adds a number of valuable tools to your chain, including functions for geocoding, reverse geocoding, and address normalization.

With multiple teams working in parallel on different projects, we needed a way to get the PostGIS extension and the Tiger set running quickly and seamlessly on different servers and for different use cases. It was therefore essential that we develop a provisioning tool to enable this flexibility. Today we’re happy to share with you two repos:

An Ansible role for the TIGER geocoder extension for PostGIS

This is for those who already use the server-provisioning software Ansible, and want to wrap this role into their existing codebase. Visit that repo here.

A fully-developed script for spinning up the TIGER geocoder.

This repo employs an Ansible playbook for those who want to start from square one. You can follow the step-by-step directions to load the whole country, or a particular state or territory, either locally or on an Amazon instance. As such, you don’t need pre-knowledge of how the underlying tools work. You can see that repo here.

The hope is that with more access will come better open technology, and by extension, a progression away from sealed fortresses of proprietary geo tools and data. Things are certainly improving, with a growing community churning out potential solutions to what remains one of the more difficult data engineering eggs to crack. We hope our contribution will help others on their way towards making sense of the ‘where’ in open data.