From ‘metadata mountain’ to molehill
If you want to know where Open Data is headed, look to science where not only is there a big drive for data sharing but also it’s the envelope-pushing, super-technical kind and not resting on the laurels of spreadsheet portals. The future is bright and it’s semantic, linked and interoperable – even if some bright folk have concerns about where that might lead.
I for one welcome our new Semantic Web overlords but there is a massive hurdle that gets in the way of this progress, something I’ll call the ‘metadata mountain’.
In the science community this mountain or bottleneck (depending on your preferred metaphor) happens when the authors of a paper that summarises several years of work are suddenly asked to find all the metadata from all their lab books, spreadsheets and Post-It Notes and put them in a super-interoperable, machine-readable format. The problem is exacerbated by how many groups and experts collaborate on science these days and how often they move labs (or lose their USB drives).
Official Statistics producers are very much like scientific researchers. We have many collaborative groups of people with their own complex areas of expertise that all feed into the pipeline that culminates in the publication of data and analyses.
Open Data is the new kid on the block and it’s trying to muscle in on well established, effective, data production pipelines, often as a publishing exercise sitting at the end of this pipeline.
Cut me and it’s Open Data all the way through
A retrospective attempt at piecing together semantically viable metadata is an insurmountable task. Trying to bring together a comprehensive Open Data story for such a large body of work at the end of the production line is like trying to squeeze the words inside a stick of rock after you’ve rolled it.
The only way for any large process to reach the summit of Open Data is to have every contributor store, tag and share the deep metadata that’s relevant to their manipulation of the data. This applies to scientists in the field, in the lab and when writing the stats software, and it applies to the rainbow of official statistics teams too.
For example, survey questions should be linked to the statistics they’ll produce (e.g Linked Data URI for ‘date of birth’ ) and statistical bulletins should document the statistical methods and the software used to analyse the data (Check out the Linked Data resource for statistical methods: STATO and there’s a budding Linked Data resource for software called the EDAM ontology).
This might seem like a major burden of work but we expect it of scientists because we want to raise the reproducibility and robustness of their work so as to increase secondary value (re-use) and to improve trust in their findings. Those goals are very relevant to official statistics producers and eventually there will be calls for bringing these benchmarks into official data and methods.
More than that, this is how Open Data principals will most likely bring efficiency improvements within public institutions as our internal procedures become transparent to ourselves and to the wider community.
By spreading these tasks throughout the pipeline we reduce the size of the mountain at any one point.
What will it take? Training and awareness-raising at all ends of our organisations is a must but what about department-wide Open Data officers a bit like we have Data Protection officers or team-specific Open Data experts like we have First Aiders or Fire Wardens even? Openness is a lens for viewing how we do our business just like health and safety or risk management.
Please comment below (or over on Twitter @bobbledavidson) with your ideas about where you think Open Data should fit within an organisation, what that would take and where it would lead.