Indiana Jones and the Raiders of Open Data

I decided to write a positive blog about Open Data; it’s an area I believe has a lot of potential and value. The only problem is that once I started I began to think about the complexities of it, it’s really not as simple as it sounds.

The Open Definition provides a very detailed definition of the requirements for ‘open’ data and content. For many of its proponents, it’s almost like the quest for the holy grail. As Professor Henry Jones put it in the movie Indiana Jones and the Raiders of Doom, “the quest for the grail is not archaeology, it’s a race against evil.” There is almost a religious belief in the value of Open Data. While that may be overstating things a bit, it is a quest for value and understanding. Open data is the underpinning of open information/knowledge; which is what open data becomes when it’s useful, actionable and used. The key features that this definition posits for openness are:

Availability and accessibility
Reusability and support redistribution
Allow universal participation

Why should data be open? – “It’s a leap of faith.” – Dr Henry Jones

There are three reasons for open data:

Transparency
Releasing social & commercial value
Supporting participation and engagement.

The participation argument is particularly important when looking at government data. By opening up government data, citizens can be much more informed and involved in decision-making. This is more than simple transparency of what local and national governments are doing: it’s about creating a full “read/write” society, not just about knowing what is happening in the process of governance but being able to actively contribute to it.

What types of data should be considered for open access?

Broadly speaking, data can be regarded as either Core or Context:

Core – The actual data from processes and activities (e.g. sales and production data)
Context – supporting data that allows you to make sense of your core data. An example would be information that allows you to understand the effect weather or cultural events in a specific area have on sales (e.g., selling more umbrellas when it’s raining). Another example is one of my wife’s favourite iPhone apps. Citymapper provides A to B trip planning, using real-time departure info for all modes of public transport as well as constantly updating information on the weather, alerts and any travel disruptions.

Social media data is an interesting, and increasingly important, data source that can fit into both of these categories. Companies know the transactions they have had with customers but often don’t know who the individual customers actually are or what motivated their buying choices. So this is a mixture of both additional core and context data.

One of the organisations I have been talking with in the UK is the Digital Catapult. They are working with large and small companies as well as government agencies to encourage the “creation of secure environments that allow UK organisations to safely mix their closed data and open it up to data innovators.” This is key: open does not mean giving up all security or even opening up all data.

Whose Data is it Anyway? – “Fortune and glory, kid. Fortune and glory.” – Indiana Jones

Companies and public sector agencies tend to treat data as a resource they own. However, it’s not always as simple as that. For example, as a healthcare patient, have you been properly informed and consented to the storage and sharing of your personal medical data? This is a growing issue. The Internet of Things Privacy Index 2015 suggests that 80% of British consumers are concerned about personal information being collected by smart devices and 73% believe they should own and control that data. I believe most people are comfortable sharing their data if they believe there is a mutual value exchange – that there is something in it for them as well.

As an example, people are generally okay with the idea that a paramedic attending an accident should have access to their data as it could directly affect their health (or life), but a hospital selling that same data to a manufacturer of products targeted at your medical condition is less acceptable to most people. The problem is compounded by the increasing ways companies and governments are collecting personal data without our knowledge or consent. Perhaps the solution would be to allow the end-user more control over who can access and use their data, but this would be a major cultural (as well as legal and financial) shift for companies and governments.

“Look at this. It’s worthless — ten dollars from a vendor in the street. But I take it, I bury it in the sand for a thousand years, it becomes priceless.” – Indiana Jones

One of my favourite quotes is “the truth is a three–edged sword. Your truth, my truth and the real truth. ” This is very relevant when looking at data; essentially data is not value neutral. So it is not only important to understand the provenance of your data but also to understand its inherent value and relevance. Thus, as in data archaeology, it is vitally important that the data is intelligible. As Indiana Jones put it “Archaeology is the search for fact… not truth. If it’s truth you’re looking for, Dr. Tyree’s philosophy class is right down the hall.” In Open Data fact is a vital prerequisite for truth. It’s a classic GIGO case, “garbage in, garbage out.”

It is also important to understand the difference between looking for patterns in the data and looking for data that supports conclusion already made. This is known as Confirmation Bias; the tendency to seek out, or give more weight to information that confirms one’s preconceptions, leading to statistical errors. This leads to a key issue of the accuracy of the information that you create from you data. The classic problem of correlation vs. causation needs to be understood and accounted for.

“Sallah, I said no camels. That’s five camels. Can’t you count”? – Indiana Jones

One of my favourite recent articles is from Wired magazine entitled “Eat cheese… and risk strangulation”. In this article Tyler Vigen illustrates a number of spurious correlations – a key danger in data analysis. Another example of potentially spurious correlation is illustrated by the “Six degrees of Kevin Bacon” which is based on the “six degrees of separation” concept, which states that any two people on Earth are six or fewer acquaintance links apart. If we assume that all contacts between people have equal weight, there is a significant risk of spurious correlation. This could lead to examples of reputation fate sharing, where you are impacted by something within which you were not involved. How do you know whether that person you chatted with about Wedgewood pottery on Facebook yesterday is a criminal, terrorist or other ne’er-do-well?

Another aspect of this is the problem of understanding the tone in which a comment is made. For example, people have been banned from travelling because of jokes they made on Twitter. The nuances of culture and humor must be important parts of any social media folksonomy.

“It’s not something to be taken lightly. No one knows its secrets. It’s like nothing you’ve ever gone after before.” – Indiana Jones

So, am I still keen on Open Data? I’d have to say, yes I am. But I recognise that like all big data activities it has significant challenges. It is important to go into it with your eyes open and address all these issues carefully and thoroughly. Or to put it another way, make sure you open your eyes before you open your data.

Also by Mateen Greenway

Mateen Greenway is the HP Enterprise Services chief technologist for the Europe, Middle East and Africa (EMEA) Public Sector where he is responsible for overall strategy, technical direction, innovation and leading the senior client facing technical leaders and key Account Chief Technologists in this major Industry.

Indiana Jones and the Raiders of Open Data

Related posts: