The limited promise of (open data) research automation
[Summary: adding to the network of discussion around open data measurement]
Earlier this week the 2nd Edition of the Open Data Barometer launched: a project I’ve worked on over the last two years, helping design the original framework of indicators, setting up the data collection tools, and writing up the report (with a fun learning curve this year to try and make it an interactive web-first report as much as possible). Putting together a study like the Barometer is a long and, at times, torturous, process. So I was really interested to see that this week the Open Data Institute also released a report on “Benchmarking Open Data Automatically“. Could there be ways around the costs and time commitments of survey based techniques to understand the open data landscape?
Unfortunately, at least so far as the report explores, it seems not.
Below I offer a critical reading and reflection, addressing both some substantive issues for consideration in taking forward efforts around automated open data measurement, and raising some general issues about the problems of approaching research with an automation lens.
Credit to the authors: they do not at any points overstate the claim for automation, and provide a fair overview of its potential. However, the methodology of the report in identifying suggested indicators and datasets §4 is unstated and unclear, and given it is framed as an output of the developing country focussed Partnership for Open Data (POD), critical consideration of the global applicability of its suggested models appears absent.
Context and environment: qualitative concerns
The report looks at the potential for automated assessment in four areas, building on the Common Assessment Methods framework for open data which covers context/environment, data supply, use & impact.
The suggested automation approaches for Context/Environment variables either involve the secondary use of global indicator data, or propose the use of text-mining approaches on legislative or communicative texts from government. Both these are problematic.
- Firstly, the report does not appear to consider that the secondary data it wants to draw upon is far from automated itself, but instead is the product of in-depth qualitative research processes, each with their own dynamics and biases. By framing use of this data as a strategy of automation, rather than a strategy of secondary data use, it glosses over the important critical attention that needs to be given when drawing on indicators from elsewhere, and over questions of whether these proposed indicators are adequate to capture the constructs being sought. In the Open Data Barometer model, primary survey data is paired with secondary indicator data (a) to address the gaps in each; (b) to ensure each component of the ‘Readiness’ Barometer sub-index is based on a plurality of data sources, to limit the potential effects researcher bias.
- Secondly, the idea that it will be possible to effectively infer meaningful information about open data policies from automated reading of legal texts is highly questionable. Not only are legal texts often inaccessible online, but the range of languages, and legal cultures, any analysis would have to deal with begs the question of whether it would not be more efficient in any case to work with bi-lingual scholars than computational techniques. The report fails to cite any examples of studies that have proven the value of such automated text-mining approaches in an international comparative context. Furthermore, automated techniques bring in a big risk of imposing normative assumptions (particularly in this case Western, Anglo-saxon legal tradition assumptions) about law, policy making, and what constitutes effective policy. When I carried out my initial policy analysis of open data in six countries though a qualitative reading of policies, it became clear that it was important to look at the function, not just the form of words in policy documents – and to understand the different ways policy operates in different countries. Should one country be assessed as having a better environment for achieving the promised impacts of open data because it uses the words ‘open data’ in laws, than one which talks about ‘free and digitally available public sector information’? I would be extremely cautious about such possibilities.
One of the weaknesses of the Open Data Barometer as it currently stands is that it assesses only at the National level, whereas in many federal states particular policies or datasets that it is looking for from a national government are devolved to states and sub-national jurisdictions. Working with the complexity of the real world generally requires greater resource for critical analysis and research, not more automation.
Data and use: the meta-data dilemma
The report also disappoints when it comes to automated assessment of both data publication and dataset re-use. It concludes that whilst “automated metrics exist for metadata that stems from data portals such as CKAN, Socrata, OpenDataSoft or DataPress” such approaches as “however, subject to the existence of high-quality metadata in a consistent, standardised and complete format.”
What it doesn’t flesh out is the fact that this makes getting at research constructs that can represent the range of data published, or the broad usability of data, as opposed to constructs that in fact pick out primarily the quality of meta-data, is particularly challenging. It also perpetuates the idea that open data involves cataloguing data in data portals, rather than making data broadly accessible, published so that citizens can access it alongside government information, embedding a culture of data publication across government, rather then treating open data as some centralised responsibility.
Such meta-data centric metrics are also inherently open to gaming. If a metric, for example, is based on the percentage of datasets, discovered through a meta-data catalogue, that conform to some automated tests for quality (e.g. the UK Governments ‘openness score’), then the easiest way to up your score is to drop from the catalogue datasets that don’t meet the quality bar. Sensitivity to gaming has is vital to any analysis of metrics: particularly those seeking to create behaviour change. If its possible to change the metric scores by changing behaviours other than those the metric aims to target, then there need to be feedback loops to control for this. These often exist just as part of the processes of analysis in qualitative research, but need to be explicitly planned for in any automated methods.
When it comes to data use assessment, much more promising (albeit resource intensive) I suspect are the bottom up methods for assessing the value of open datasets for particular use cases being explored by some of my Web Science DTC colleagues at the University of Southampton.
Impact: researcher support
The report notes that “measuring impact with an automated approach is inherently difficult”. The Open Data Barometer uses, as a proxy for impact, the peer reviewed ‘perceptions of impact’ amongst informed researchers, based on being able to cite media or academic literature that connects open data and specific impacts. Whilst the expert judgement can’t be automated – it would be useful to consider where automation could support the researcher in locating relevant sources.
For example, right now researchers have to go out and track down news stories, and there are some inter-coder reliability issues that are likely to arise as a result of different search strategies and approaches. But the same technologies proposed for text-analysis of legal documents, rather than producing an output metric, could produce a corpus of sources for human analysis – helping dig deeper into sources of the stories of impact. A similar approach may be relevant to researcher assessments of datasets and data uses as well: using online traces to support researchers to locate and profile dataset re-use, but recognising the need for assessments that are sensitive to different contexts and cultures (e.g. for some datasets, searching on GitHub can turn up all sorts of open source projects that are re-using that data, but this only works in countries and cultural communities where GitHub is the platform of choice for source code sharing).
An automatic no?
Should we reject all automated methods in open data research? Clearly not: projects like the Open Data Monitor, working in the European Context of a single market and substantial cross-government standardisation already, and concepts like Ulrich Atz’s Tau of Open Data, have the potential to offer very useful insights as part of a broader research framework. And the kinds of secondary data consider in the automation report can be useful heuristics in understanding the open data landscape, providing they are read critically.
However, rather than focus on automated measurement, focussing on tools to help researchers locate sources, collate date, and make judgements may be a much more appropriate place to focus attention. Big data methods rooted in working with volume and variety of sources, rather than meta-data driven methods that require standardisation prior to research are likely to be much more relevant to the diverse world one encounters when exploring open data in a development context.