Data is the new generic buzzword used all across disciplines and data science has already been coined one of the key skills of the 21st century. Never before has mankind produced more data than today in business and in scientific realm. In order to derive meaning and knowledge from the data, we need to process it and feed it into algorithms and scientific experiments. With more and more data available, how can we find what we are looking for? How can we reproduce results? How can we verify experiments? This little page aims to shed some light on current research in the area of dynamic data citation.
Citation is a fundamental principle of scientific work. Research is a collaborative approach which does not succeed in isolation. All the knowledge that we have today is based on previous work. For this reason scientists need to indicate, which parts of their work have been based on principles and discoveries from peer researchers. The principle of citation is very well established in the scientific communities and it is the basis of the peer review process. As more and more disciplines become "digitised", more and more scientific results are based on data and its processing. With an increasing complexity of scientific experiments, the value of the data increases. More sophisticated experiments are more expensive and some experiments can not be executed a second time. Therefore their results and the produced data sets are unique. Hence it is a logical consequence, to preserve the data and share it with other scientists, who might reuse datasets in completely new contexts. Data citation provides tools and mechanisms which allow identifying data sets in a unique and persistent way.
Currently many data sets are stored on Web servers and they are referenced by standard URLs which point to a file system path. As server or file system locations can change, URLs are not suitable for secure long term reference. Thus the concept of persistent identifiers has been developed. A persistent identifier may link a unique URI to a landing page providing metadata about the data set or the data itself. The relationship between the URI and the actual location of the data can be updated, hence the system is resilient to location changes. The identifier can be resolved and thus always points users to the appropriate location. Thus the persistently of the link can be preserved.
Reproducible research is based upon the accessibility of data. In addition to the experimental setup, descriptions of the methods and knowledge about the execution environment, the data needs to be available as well. Only if the very same data can be used for rerunning a scientific experiment, the correctness of the results can be verified. Data citation supports scientists by providing identifiers which can be used for referencing data sets.
Yet it is not sufficient to provide file dumps of potentially huge data sets as researchers often work with highly specific data sets. Many experiments require a particular view of the data, thus the knowledge how to create such subsets needs to be preserved as well. Storing each and every revision of all data sets does not scale, hence a more flexible approach is needed.
Also research data is highly dynamic as the execution of an experiment may require the adjustment of parameters in order to improve the results. As a consequence the data produced in an experiment is constantly evolving. New records are created, existing ones are updated and erroneous ones may get deleted. Researchers require the possibility of referencing any previous version or iteration of a specific subset.
Dynamic data citation provides tools and methods which allow referencing a specific state of a subset, which was derived from an evolving data source.
In order to tackle the requirements mentioned above, several core principles need to be applied. First of all our approach for dynamic data is based upon versioned and timestamped data. We have noticed that the term versioned is a source of confusion. We explicitly do not require each and every update of the data source to be persisted in terms of snapshots. In fact this is what we aim to avoid. With versioning we mean that every change on the data source is recorded and no data is deleted. Further more our approach is also based on some form of query language which is available for researchers to create data sets. Our approach is query centric and utilises the same query mechanism as the researcher did. We persist the queries in a query store and re-execute the query on demand, considering only the data appropriate to the time frame of the execution. In terms if persistent identification we are agnostic which system is to be used.
The Research Data Alliance (RDA) is a large community of data scientists which provides a structure in the form of interest groups. Each interest group is comprised of several working groups, which focus on specific problems. The RDA Working Group on Data Citation (WG-DC) aims to bring together a group of experts to discuss the issues, requirements, advantages and shortcomings of existing approaches for efficiently citing subsets of data. The WG-DC focuses on a narrow field where we can contribute significantly and provide prototypes and reference implementations for dynamic data citation. The WG-DC has developed a set of recommendations, how research data can be made citable. The document describing these recommendations is still developing and available for inspection and providing feedback at the RDA group page
The recommendations have been structured into three sections, in order to align with the phases of dynamic data citation. These recommendations are designed in a way that they are agnostic about the technology stack used. They re generic suggestion how evolving data can be made citable. The first phase covers preparing the data storage and the query store. The second phase describes how data sets and subsets can be persistently identified. The third phase provides guidelines how to handle change in the technological infrastructure.
In order to retrieve a specific record the way it was at a specified point in time, we need to keep previous versions available. This does not necessarily mean that we have to keep all revisions of a record.
For understanding when a change was introduced into a data set, we have to annotate each version with a timestamp. The granularity of the timestamp depends on the frequency of updates of the data source.
The Query Store is the essential component responsible for storing query metadata. When a subset should be referenceable and persistent, the information how the data set was created is stored in the Query Store for later retrieval.
A PID references by definition exactly one unique object. For this reason, we need to ensure that we can detect identical queries. Only if a query was previously unknown to the system, a new PID can be assigned.
The way how records are sorted in a data set can have large impact in downstream processing. For this reason, there needs to be defined sorting available.
Reproducibility is based on trust. Therefore mechanisms for verifying that a subset retrieved by re-execution of the query against versioned data is exactly how it was is essential.
For mapping a query against as specific state of the data store, the query store needs to record the exact date of the query execution
As our approach is query centric, we assign PIDs to queries instead. Upon the PID resolving, the query is re-executed and retrieves the data as it was originally.
The Query Store persists all metadata and properties needed for re-executing a query. This includes the PID, original and normalised query, query and result set checksum, timestamp, super set PID, data set description and other information.
As the Query Store already persists additional metadata, such as authors, creators, PIDs and timestamps, the system can automatically provide a recommended citation text for the user. They can then copy the text and paste it into publications.
When users enter the PID of a subset they should be presented with a human readable landing page of the retrieved data. The landing page also provides links to the super set and displays metadata.
For facilitating automation, the system should provide machine readable content and provide actionable resources which can be consumed by non-human actors. This allows interacting with data sets in an interactive fashion.
As technology is advancing, change is being introduced in existing systems and requires to migrate data, query store and access facilities to a new system.
After the data and systems have been migrated to a new platform, we need to ensure that the data sets and subsets are consistent with the legacy system. Thus the success of the migration needs to be verified.
Data citation of dynamic data allows identifying, retrieving and citing the precise data set with minimal storage overhead by only storing the versioned data and the queries used for creating the data set. In many environments data versioning is considered a best practice. Data sets can be re-created on demand. It allows retrieving the data both as it existed at a given point in time as well as the current view on it,by re-executing the same query with the stored or current timestamp, thus benefiting from all corrections made since the query was originally issued. This allows tracing changes of data sets over the time and comparing the effects on the result set.
The query stored as a basis for identifying the data set provides valuable provenance information on the way the specific data set was constructed, thus being semantically more explicit than a mere data export.
Metadata such as checksums support the verification of the correctness and authenticity of data sets retrieved. This enhances trust.
The recommendations are applicable across different types of data representation and data characteristics (big or small data; static or highly dynamic; identifying single values or the entire data set). If data is migrated to new representations, the queries can also be migrated, ensuring stability across changing technologies. Distributed data sources can be managed by relying on the local timestamps at each node, avoiding the need for expensive synchronisation in loosely coupled systems.
In the following section we present the prototype for the dynamic data citation of CSV files. This prototype is one of the first tangible outcomes of the RDA WG on Dynamic Data Citation. The system allows users to upload their CSV files and migrates them into a database scheme supporting dynamic data citation. Users can utilise the interface provided by their prototype for filtering and sorting specific subsets of CSV files and reference these subsets persistently with unique identifiers. While users are creating subsets, the system traces the query parameters and stores each query in the query store. As all data is versioned, the system can re-execute any query and retrieve the data exactly as it was at any given point in time.
Further CSV data sets can be updated. The system automatically detects changed records and maintains a versioned history of all records. Data is never deleted and changes are traceable. Each data set and each subset is stored with proper authenticity metadata which allows to verify the correctness of subsets and whole data sets. The interface allows users to retrieve their data again as CSV files and thus enables reproducibility for scientific experiments which use CSV data as exchange format. See the videos for more details.
This screen cast demonstrates how the CSV data of the million song data set  can be made citable. After the upload is completed, the user selects a unique key and the system automatically migrates the data into a relational database management system. Timestamps and additional metadata is automatically added.
 Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
The second screen cast shows how a individual subset can be created and be rendered citable. You can see the landing page and the metadata it provides.
The third screen cast demonstrates how a PID assigned to a super set (the database table) and an individual subset can be resolved to the corresponding landing pages. The landing pages are dynamically generated and can also serve machinery via an API. The downloads of the data sets are created on demand by re-executing the query which was stored in the first place. The PIDs are currently created in an ARK fashion and are not globally resolvable yet. The next iteration of the prototype will facilitate the RDA recommendation for PIDs.
This screen cast shows how existing data can be updated. In the demo some records are altered and one is deleted. The system automatically detects the changes in the CSV file and allows accessing previous and current versions. Obviously this is only a small demo, in practise changes could be introduced with much higher frequencies.
Several initiatives are active in the area of data citation. This is a growing list of current efforts in the area of data citation. Please let me know if your approach should be added.
The RDA Working Group on Data Citation (WG-DC) aims to bring together a group of experts to discuss the issues, requirements, advantages and shortcomings of existing approaches for efficiently citing subsets of data. The WG-DC focuses on a narrow field where we can contribute significantly and provide prototypes and reference implementations. You do have an interesting data citation use case? Please get in touch!
The Force 11 group have published the Joint Declaration of Data Citation Principles. The group has identified eight core principles for data citation: importance of data, credit and attribution, evidence, unique identification, access, persistence, specificity and verifiability and last but not least interoperability.
The Committee on Data for Science and Technology (CODATA) provides best practices for citing data and published the Out of Cite, Out of Mind report on current state of practice, policy, and technology for the citation of data.
The DataCite organisation actively promotes data citation and aims to increase data sharing and data citation acceptance. DataCite provides DOI persistent identifiers and related services.