A while ago I asked the question: “What is a Dataset?“. The idea was to look at how different data communities were using the term to see if there were any common themes. This week I’ve been considering how UPRNs can be a part of open data, a question made more difficult due to complex licensing issues.
One aspect of the discussion is the idea of “derived data“. Anyone who has worked with open data in the UK will have come across this term in relation to licensing of Ordnance Survey and Royal Mail data. But, as we’ll see shortly, the term is actually in wider use. I’ve realised though that like “dataset”, this is another term which hasn’t been well defined. So I thought I’d explore what definitions are available and whether we can bring any clarity.
I think there are several reasons why having a clearer definition and understanding of what constitutes “derived data”:
- When using data published under different licenses it’s important to understand what the implications are of reusing and mixing together datasets. While open data licenses create few issues, mixing together open and shared data, can create additional complexities due to non-open licensing terms. For further reading here see: “IPR and licensing issues in Derived Data ” (Korn et al, 2007) and “Data as IP and Data License Agreements” (Practical Law, 2013).
- Understanding how data is derived is useful in understanding the provenance of a dataset and ensuring that sources are correctly attributed
- In the EU, at least, there are many open questions relating to the creation of services that use multiple data sources. As a community we should be trying to answer these questions to identify best practices, even if ultimately they might only be resolved through a legal process.
On that basis: what is derived data?
Definitions of derived data from the statistics community
The OECD Glossary of Statistical Terms defines “derived data element” as:
A derived data element is a data element derived from other data elements using a mathematical, logical, or other type of transformation, e.g. arithmetic formula, composition, aggregation.
This same definition is used in the data.gov.uk glossary, which has some comments.
The OECD definition of “derived statistics” also provides some examples of derivation, e.g. creating population-per-square-mile statistics from primary observations (e.g. population counts, geographical areas).
Staying in the statistical domain, this britannica.com article on censuses explains that (emphasis added):
there are two broad types of resulting data: direct data, the answers to specific questions on the schedule; and derived data, the facts discovered by classifying and interrelating the answers to various questions. Direct information, in turn, is of two sorts: items such as name, address, and the like, used primarily to guide the enumeration process itself; and items such as birthplace, marital status, and occupation, used directly for the compilation of census tables. From the second class of direct data, derived information is obtained, such as total population, rural-urban distribution, and family composition
I think this clearly indicates the basic idea that derived data is obtained when you apply a process or transformation to one or more source datasets.
What this basic definition doesn’t address is whether there any important differences between categories of data processing, e.g. does validating some data against a dataset yield derived data, or does the process have to be more transformative? We’ll come back to this later.
Legal definitions of derived data
The Open Database Licence (ODbL), which is now used by Open Streetmap, defines a “Derivative Database” as:
…a database based upon the Database, and includes any translation, adaptation, arrangement, modification, or any other alteration of the Database or of a Substantial part of the Contents. This includes, but is not limited to, Extracting or Re-utilising the whole or a Substantial part of the Contents in a new Database.
This itemises some additional types of process, namely that extracting portions of a dataset also creates a derivative and not just transformation or statistical calculations.
However, as noted in the legal summary for the Creative Commons No Derivatives licence, simply changing the format of a work doesn’t create a derivative. So, in their opinion at least, this type of transformation doesn’t yield a derived work. In the full legal code they don’t use the term “derived data”, largely because the licences can be applied to a wide range of different types of works, they instead define an “Adapted Material“:
…material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor.
The Ordnance Survey User Derived Dataset Contract (copy provided by Owen Boswarva), which allows others to create products using OS data, defines “User Derived Datasets” as:
datasets which you have created or obtained containing in part only or utilising in whole or in part Licensed Data in their creation together with additional information not obtained from any Licensed Data which is a fundamental component of the purpose of your Product and/or Service.
The definition stresses that the datasets consist of some geographical data, e.g. points or polygons, plus some additional data elements.
The Ordnance Survey derived data exemptions documentation has this to say about derived data:
data and or hard copy information created by you using (to a greater or lesser degree) data products supplied and licensed by OS, see our Intellectual Property (IP) policy.
For the avoidance of doubt, if you make a direct copy of a product supplied by OS – that copy is not derived data.
Their licensing jargon page just defines the term as:
…any data that you create using Ordnance Survey mapping data as a source
Unfortunately none of these definitions really provide any useful detail, which is no doubt part of the problems that everyone has with understanding OS policy and licensing terms. As my recent post highlights, the OS do have some pretty clear ideas of when and how derived data is created.
The practice note on “Data as IP and Data License Agreements” published by Practical Law provides a great summary of a range of IP issues relating to data and includes a discussion of derived data. Interestingly they highlight that it may be useful to consider not just data generated by processing a dataset but other data that may be generated through the interactions of a data publisher (or service provider) and a data consumer. (See “Original versus Derived Data“, page 7).
This leads them to define the following cases for when derived data might be generated:
- Processing the licensed data to create new data that is either:
- sufficiently different from the original data that the original data cannot be identified from analysis, processing or reverse engineering the derived data; or
- „a modification, enhancement, translation or other derivation of the original data but from which the original data may be traced. „
- Monitoring the licensee’s use of a provider’s service (commonly referred to as usage data).
From a general intellectual property stance I can see why usage data should be included here, but I would suggest that this category of derived data is quite different to what is understood by the (open) data community.
What I find helpful about this summary is that it starts to bring some clarity around the different types of processes that yield derived data.
The best existing approach to this that I’ve seen can be found in: “Discussion draft: IPR, liability and other issues in regard to Derived Data“. The document aims to clarify, or at least start a discussion around, what is considered to be derived data in the geographical and spatial data domain. They identify a number of different examples, including:
- Transforming the spatial projection of a dataset, e.g. to/from Mercator
- Aggregating data about a region to summarise to an administrative area
- Layering together different datasets
- Inferring new geographical entities from existing features, e.g. road centre lines derived from road edges
In my opinion these types of illustrative examples are a much better way of trying to identify when and how derived data is created. For most re-users its easier to relate to an example that legal definitions.
Another nice example is the Open Streetmap guidance on what they consider to be “trivial transformations” which don’t trigger the creation of derived works.
An expanded definition of derived data
With the above in mind, can we create a better definition of derived data by focusing on the types of processes and transformations that are carried out?
Firstly I’d suggest that the following types of process do not create derived data:
- Using a dataset – stating the obvious really, but simply using a dataset doesn’t trigger creating a derivative. Open Street Map calls these a “Produced Work”.
- Copying – again, I think this is should be well understood, but I mention it for completeness. This is distribution, not derivation.
- Changing the format – E.g. converting a JSON file to XML. The information content remains the same, only the format is changed. This is supported by the Creative Commons definitions of remixing/reuse.
- Packaging (or repackaging) – E.g. taking a CSV file and re-publishing it as a data package. This would also include taking several CSV files from different publishers and creating a single data package from them. I believe this is best understood as a “Collected Work” or “Compilation” as the original datasets remain intact.
- Validation – checking whether field(s) in dataset A are correct according to field(s) dataset B, so long as dataset A is not corrected as a result. This is a stance that Open Street Map seem to agree with.
This leaves us with a number of other processes which do create derived data:
- Extracting – extracting portions of a dataset, e.g. extracting some fields from a CSV file.
- Restructuring – changing the schema or internal layout of a database, e.g. parsing out existing data to create new fields such as breaking down an address into its constituent parts
- Annotation – enhancing an existing dataset to include new fields, e.g. adding UPRNs to a dataset that contains addresses
- Summarising or Analysing – e.g. creating statistical summaries of fields in a dataset, such as the population statistics examples given by the OECD. Whether the original dataset can be reconstructed from the derived will depend on the type of analysis being carried out, and how much of the original dataset is also included in the derived.
- Correcting – validating dataset A against dataset B, and then correcting dataset A with data from dataset B where there are discrepancies
- Inferencing – applying reasoning, heuristics, etc to generate an entirely new data based on one or more datasets as input.
- Model Generation – I couldn’t think of a better name for this, but I’m thinking of scenarios such as sharing a neural network that has used some datasets as a training set. I think this is different to inferencing.
What do you think of this? Does it capture the main categories of deriving data? If you have comments on this then please let me know by leaving a comment here or pinging me on twitter.