Data standards and inclusion in the network society

This note is based on a presentation given at the ‘Inclusion in the Network Society Roundtable‘ hosted by IT For Change in Bangalore, 29th September – 1st October. It further develops a number of points that were not included in the talk due to time.

The original presentation responded to themes from earlier talks, including Nishant Shah’s insightful challenge to the notion of the ‘network’ as the organising concept of our time, Mark Graham and Christopher Fosters’s empirical look at inequalities of information production and use, and Roberto Bissio’s look at the ‘Data Revolution’ and trends in big data, and so may make most sense read in light of these.

The hidden regulatory power of data standards

Lawrence Lessig has famously declared that ‘Code is Law’ (Lessig 2006), drawing our attention to the regulatory function played by digital systems in our modern world. As we think about Inclusion in the Network Society, I want to also draw attention to the role of data and data standards in structuring the systems that affect inclusion and exclusion, and, for a short while, to surface these increasingly important elements of our digital infrastructure for scrutiny.

I was drawn explore data standards and their political salience through my work on open data policies around the world. The standard narrative of open data suggests that open data is about simply transferring datasets from inside the state, to being published openly on the Web. All that changes is where the dataset sits, and who can use it. Yet, in practice this is not the case at all (Davies and Frank 2013). Datasets rarely exist as simple atomic things. Instead, opening data often involves the construction of new datasets, standardised, reshaped and filtered. Legitimate privacy concerns mean that much state data, based as it is on personal records, cannot be made wholesale open, but instead aggregated extracts only can be published. And in presenting data to the world, governments may take disparate datasets and seek to render them using common external-facing standards.

Watching open data operate in practice, I have observed a growth in efforts at standardisation. And unlike past standardisation projects which took place in the committees of standards organisations, this new wave of standardisation has a more ad-hoc nature to it: taking place across the Web, and often at global scale. Standards not only configure the data that gets published, they also configure the tools available to work with it, and in the process change the costs and benefits a government choosing to publish data faces when deciding whether to use a particular standard or not. Standards both enable certain uses of data, and provide a constraint on what can be expressed, what can be known from data, who can be known, and who can know.

Data structures and standards have an interesting property when compared to the other modes of regulation in a network society. Lessig identifies four modes of regulation:

  • Law
  • Markets
  • Norms
  • Architecture

And data standards sit alongside code as part of the ‘architecture’ mode. Whilst our network society can make law, markets and norms more visible and contestable, by default code, data and standards become an embedded part of the background, rarely subjected to scrutiny, and rarely open to be shaped by those who they affect.

Considering the regulations faced in day-to-day life, when acting we have a choice between:

  1. Taking action within the current ‘rules’ of the game (being an active consumer; voting; create something); or
  2. Changing the rules;

The malleability of the rules depends on both the level at which they are set, and how visible they are to us as subjects of critique. Law operates, more or less, at level of nation state. Markets, although with many invisible aspects, are increasingly be made visible to us through media technologies: and alternative models of production and consumption offer opportunities to change within the system, and to some extent to change the rules of the system. Cultural norms are similarly surfaced and challenged by communications networks. Yet, architecture becomes less visible. It takes what Bowker and Star (Bowker 2000) call an ‘infrastructural inversion’ to get us to see it. We often only become aware of these infrastructures when they break down. In scrutinising impacts of infrastructures on inequality and equality, we could look at architectures and policies at all levels of the Internet stack. Questions of net neutrality for example, are well explored, in terms of their potential to consolidate the power of the powerful, at the expense of new participants in the network. However, my focus is on the emerging network of architectures for data on the Web: sitting at the content layer, and built out of data standards, shared categories and identifiers. Michael Gurstein has drawn attention to the fact that open data is likely to be more empowering to those who have the access, skills and capacity to get things out of the data (Gurstein 2011) – but I want to draw attention to the way it also empowers those who have the ability to get things into the data, and most importantly, those who can get things baked into the standards through which data will be exchanged.

The field-level tussle

To give a concrete example, let me turn to a project I’m currently working on: the Open Contracting Data Standards. The World Wide Web Foundation is leading work to prototype a data standard that will help deliver on government pledges to the Open Contracting Partnership that all stages of procurement, land and extractives contracts will be made open: published in detail online. Because lots of disparate datasets on contracting would be hard to assess for their value and would be hard to analyse, the standard is needed to specify how such data should be released.

We’ve gone about building the standard from three directions: firstly, looking at the data that states already published (there is little point in building a standard that can only represent data which doesn’t actually exist); secondly, looking at detailed user stories of how different communities, from corruption monitors, to small firms competing for government contracts, want to make use of contracting data. Needless to say their is not perfect alignment between these. Thirdly, we’ve been looking at existing standards and tools and looking for opportunities to re-use rather than invent approaches to represent data: identifying the best ways to maximise the number of people who will be able to access and use the data, without sacrificing too much expressiveness or extensibility from our data.

Things that might seem like obscure technical choices, such as how to represent location in the data, are ultimately socio-technical tussles, with significant potential consequences. To follow the example through: users often want to know the locations related to a contract, but which location? The jurisdiction of the procuring body? The location where goods and services are to be delivered? What about when there are multiple goods and services going to different locations? In designing a standard decisions have to be made as to whether to allow location to be attached anywhere in the data, or whether to restrict the ways it can be expressed. If allowing location of contracts, organisations, line-items and so-on to all be expressed – what is the chance that different governments will actually capture this in their database, and that tools will develop that can cope with location data at all these levels? Design choices are constrained by existing database infrastructures of the state: but they also feed back into creating those infrastructures in the future. If only the owners of existing systems drive the standardisation process, new data needs will fail to be accommodated in the standard. But if it is based on ideals, rather than realities of currently held data, it will struggle to see adoption.

Flattening the network society

If readers will permit a further excavation of the infrastructural layers of data standards, even beneath the level of discussion fields, objects and properties within a datasets, there are substantial questions and tensions to explore about the different data models being used on the Web, and their implications for inclusion and empowerment.

In the early days of open data, a lot of emphasis was placed on Linked Open Data (LOD), represented through the RDF (Resource Descriptor Framework) model. This is a graph based data model: essentially a way of representing data built for the network. A truly network society data model. In LOD, everything is either a node, or a link. In fact, even nodes are identified using links. Theoretically, data can be distributed right across the web: the power to describe different things devolved to disparate actors and data systems through identifiers layered on top of the domain name system. Yet, in practice, LOD is computationally and cognitively expensive to work with. Although graph-based data models underlie many of the most popular services on the Web (not least in the social graphs of Facebook and Twitter), these run on centralised infrastructures, and a distributed Web of data has struggled to catch on.

Similarly, XML, which stands for eXtensible Markup Language, once seen as the future data interchange language of the Web, with it’s strong validation and schema languages and ability to mix document and data by adding structured meaningful markup to text, has lost its place in programmer hearts to JSON (Javascript Object Notation), a schema-free simple model, easy to author by hand, and capturing data in simple tree structure (a tree structure is like the numbered paragraphs in a very detailed report – with section 1, 1.1, 1.1.1 and so-on). However, early this year, even JSON appeared to be losing ground, as Jeni Tennison, a member of the World Wide Web Consortium’s (W3C) Technical Architecture Group, and formerly one of the foremost advocates of both XML and Linked Data, declared it to the by the year of CSV (Tennison 2014): Comma Separated Values, or simple flat tabular data.

CSV wins out as a ‘standard’ because it is so simple. Just about any system can parse CSV. It’s a lowest-common-denominator of data exchange – increasingly required because with open data it’s not just government engineers exchanging data with other government engineers with the same background and training: but government exchanging data with all sorts of firms, citizens and non-skilled users. Flat data is also computationally cheaper to process: not least in large quantities where datasets with millions of rows can be split apart and processed using map-reduce functions that send off bits of the data to different computers to rapidly process in parallel, re-assembling the results when done. Paradoxically, the opening of data, potentially drives us towards (over?)-simplified expressions of messy social realities, including more people in the community of potential data users, whilst struggling to accommodate more the complex, networked data structures we might need to more inclusively describe our world.

Standardisation around flat data also has some more basic inclusion risks. The current proposed best-practices for CSV on the web are based on defining the meaning of columns by column position. Column A, for example, will always be a ‘contract ID’, column B the amount it is for and so-on. This is much more brittle than the extensible tree structures of JSON where a user can drop in new fields and data if they require, and much less sensitive to multilingualism than data languages such as XML for example. The choices made in the next year over CSV on the Web Standardisation potentially have big consequences for wide representations of knowledge and data on the web. Most often, choices of data serialisations and standards to use are made as pragmatic ones: seeking paths of least resistance given what is known about developer and user skills. Yet, these choices can be embed important knowledge politics, and there is a need to explore more how far architects of open data on the web are consciously seeking to combine that politics with pragmatism.

Shaping systems and infrastructures: how open data creates opportunities and challenges

If a standard is successful, then systems start adapting to fit with it. This can be seen in another global open data project – the International Aid Transparency Initiative (IATI). IATI provides a standard for talking about aid projects, and now, after five years of development, governments are starting to adapt their internal systems to be based around the standard: shifting their patterns of data collection in order to better generate IATI data, and in the process potentially scrapping fields they used to collect. The standard may even come to trump local businesses requirements for data, as costs of using it fall, and costs of customisation are comparatively high.

The point here is that the invisible infrastructure of data standards start to shape the rules of the game, determining the data that states collect, what can be easily known, and what kinds of policies can be made and measured. This impact of standardisation of course is not new at all (Bowker and Star’s work (Bowker 2000) explores the 100 year old International Classification of Disease and it’s impacts on health policy making across the world) but open data standardisation brings a number of new elements to the picture:

  • Increased intergovernmental cooperation at the technical level;
  • Increased interaction with boundary partners in the generation of standards – often private sector partners;
  • Increased embeddedness of standards within the wider emerging Web of data;

Open data creates space for a different dynamic to standardisation, and opportunities for more inclusive approaches to standard setting. Yet it also risks leaving us with new standards created through formally open pseudo-meritocratic, but in practice, relatively closed process, which do not incorporate a critical politics of inclusion, and which generate lowest common denominator standards that restrict rather than enhance our ability to see the state, and to imagine and pursue a wide range of future social policy choices.

To return to Michael Gurstein’s critiques of open data, Michael has suggested we need to talk not just about ‘open data’, but about ‘open and inclusive data’. The same goes for data standards, with the need to find methods for building not just open standards, but open and inclusive standards for our networked futures.

References

Bowker, Geoffrey C. 2000. “Biodiversity Datadiversity.” Social Studies of Science: 643-683. http://epl.scu.edu/~gbowker/biodivdatadiv.pdf.

Davies, Tim, and Mark Frank. 2013. “‘There’s No Such Thing as Raw Data’. Exploring the Socio-Technical Life of a Government Dataset.” In Web Science 2013. November 2012. Paris, France.

Gurstein, Michael. 2011. “Open Data: Empowering the Empowered or Effective Data Use for Everyone?” First Monday 16 (2). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3316/2764.

Lessig, Lawrence. 2006. Code. Version 2. Basic Books. http://codev2.cc/.

Tennison, Jeni. 2014. “2014: The Year of CSV | News | Open Data Institute.” http://theodi.org/blog/2014-the-year-of-csv.

Posted in Benefits of open data, Informing Decision-making, Posts from feeds, Smart communities, Transparency Tagged with: , , , ,