Finding First Names in Open Data
The other day I read a plaintive email from a man with two kids who was asking a Sarasota County Commissioner for advice on looking for a job. I hadn’t hacked the man’s Gmail account – I was just checking out the Sarasota County Commissioner email database. All emails sent to and from Sarasota County Commissioners are a matter of public record in Florida and so County’s website makes it very easy to search through them. The Sarasota commissioners must be well aware that every hasty reply will end up online, but did the job seeker know that his full name, his description of his job situation and his personal email address would be online for me to see?
This started me thinking about Personally Identifiable Information (PII) in government open datasets. PII is defined by the Department of Homeland Security as “any information that permits the identity of an individual to be directly or
indirectly inferred.” The DHS then distinguishes between plain PII and “sensitive” PII which includes social security numbers and information on a person’s religious affiliation, sexual orientation or ethnicity. How much PII is in government open datasets? I took a look at all the data in one portal, the New York State Open New York site.
I combed through all 1,383 datasets on the open.ny.gov site, looking for any set that included information on individual people. For the sake of this investigation I defined PII as any one of the following:
1) The name of a specific person who is not an elected official
2) The personal email of a specific person
3) The personal phone number of a specific person
4) The home address of a specific person
I found PII in 69 or 4.9% of the datasets on open.ny.gov. This PII ranged from the names of all CEOS of active corporations registered in New York to the jockey drivers involved in horse deaths or breakdowns to the names of all building managers for any structures with an oil boiler.
The most accessed datasets on the open.ny.gov website are relatively more likely to involve PII. Three out of ten of the most accessed datasets, or 30%, contain information related to specific people.
By definition, open data must be usable for secondary purposes other than the reason for which it was collected. There are very good reasons why the government should maintain a list of unlicensed plumbers who have received disciplinary action against them – who knows what further pipe havoc these plumbers could wreak? Yet should these unlicensed plumbers have their mistakes held against them in other aspects of their life? What happens when this data is combined with the other information that exists online about this individual? The New York State Open Data Handbook acknowledges that some of the datasets may contain PII that and states that “even if there are no legal impediments to publishing the data, releasing the data may have unintended or undesirable effects.” It is time to consider these unintended effects.