Chicago builds ETL toolkit for open data
21 Jan 2015 Marcel 0
Data officials in Chicago are churning out open datasets faster than ever by using technology rather than developers to get the job done.
About a year ago, the city government embedded Pentaho Data Integration (PDI), a graphical extract-transform-load (ETL) tool with pre-built and custom components to process big data, into its OpenData ETL Utility Kit. The kit provides several utilities and a framework to help governments extract data from a database and upload it to an open data portal using automated ETL processes.
Before working with PDI, city workers updated datasets manually, said Jon Levy, open data program manager at the Chicago Department of Innovation and Technology. “Most of it was custom-written Java code and that just became difficult to maintain, difficult to enhance because it required a very complicated skill set,” Levy said.
That also meant Java developers were spending time on updates rather than writing applications that could help city workers and residents, added Tom Schenk, the city’s chief data officer.
“What’s different now is we have a framework that can be easily used by a lot of people,” Schenk said. “I could also give that tool to a number of users around the city of Chicago and they’d to be able to program ETLs that are going be easier for them to understand, easier for them to create. It allows us to be more nimble.”
For example, the city was able to tap into an application programming interface (API) that monitors water quality at Lake Michigan beaches and uses ETL created with PDI to push the information out hourly.
More recently, an organization used the city’s open data to create a map-based app showing where people need permits to park in Chicago, Schenk said. Real estate agents now can show what residential parking looks like in a given area – a task that would have taken considerable manual effort before, Levy added.
The two officials also use the information to perform advanced analytics and to merge data to develop predictive models. And every time something is uploaded to the portal, they get email notices. Other features include a log to track what’s happening in the portal and a tool that lets users monitor how long it took to run ETLs over time to diagnose problems.
Chicago’s toolkit is free to download. To set it up, IT managers must configure about six options to orient it and direct it to their portal. From then on, every time an ETL is needed, a base template exists.
“Probably the most complex thing of all is to write the code to extract data from a server,” Schenk said. “At that point, you work with a database administrator, who gives you the SQL code or whatever it might be, you extract it from a server and then you configure about four specific things, such as what dataset is this called.”
At the end, you can publish the dataset.
“The whole workflow is not about custom development, it’s about entering simple questions, simple procedures, and that allows you to get online and running,” Schenk added.
The result is that datasets can be opened and updated in an afternoon because the toolkit removes the need to wait on staff members. The data portal can also be extended without having to add more people, said Schenk, noting that automation is the key to successful open data programs without large staffs.
Chicago had all the infrastructure in place when it started. The toolkit is compatible with Windows, Apple and Linux operating systems, making integration easy, Schenk said.
He hopes officials at other city governments use the technology to further their own open data programs – and that they report back on how the ETL tool can be tweaked to ensure needed features are added.
“What we’re looking forward to is getting feedback from other cities to see how the ETL can be used and how it can be tweaked,” Schenk said.