Version 1 of the DCC checklist for appraising research data
By Angus Whyte, Published: 31 October 2014
Please cite as: DCC (2014). 'Five steps to decide what data to keep: a checklist for appraising research data v.1'. Edinburgh: Digital Curation Centre. Available online: /resources/how-guides
This work is licensed under Creative Commons Attribution BY 2.5 Scotland, except Section 4, which is adapted under licence CC-BY-NC-SA. from UK Data Archive (2013) Data management costing tool and checklist. Available at: http://www.data-archive.ac.uk/create-manage/planning-for-sharing/costing
Browse the guide below or download the pdf.
** This publication is available in print and can be ordered from our online store **
This guide aims to help UK Higher Education Institutions aid their researchers in making informed choices about what research data to keep. The content complements other DCC guides: How to Appraise & Select Research Data for Curation,[1] and How to Develop Research Data Management Services. [2] The guide will be relevant to researchers making decisions on a project-by-project basis, or formulating departmental guidelines. It assumes that decisions on particular datasets will normally be made by researchers with advice from the appropriate staff (e.g. academic liaison librarians) taking into account any institutional policy on Research Data Management (RDM) and guidance available within their own domain. As such, the guide should also be relevant to staff with responsibility for defining such policy in a Higher Education Institution, a Professional or Learned Society or similar disciplinary body.
The guide assumes that part way through their research the Principal Investigator, or other researcher responsible for data management, will want to choose what data to keep, informed by commitments already made to share or retain data (e.g. in a Data Management Plan) . The unit of appraisal is a ‘data collection’ and this may include different files carrying different access permissions and/or licence conditions.
The text also assumes that the institution will provide the following capabilities:
No assumption is made about how either of the above capabilities will be provided; for example, they might be repository or managed storage services, distinct from or integrated with a publications repository or a CRIS (Current Research Information System). In either case the capability could be provided in-house, or outsourced e.g. through Janet Cloud Services.[3] The guide may be adapted to reflect local services and guidance on selecting external repositories for data deposition. [4] DCC can provide help with this customisation to institutions’ needs and visual design.[5]
Angus Whyte, Digital Curation Centre
As a researcher you will probably select from the data available at various points in the research cycle. You will select from the data sources available to work with at the outset of your study, select from the data assembled for analysis and then select analysed data to make further statements about what has been found, some of which may be included in a publication.
With more digital technologies being used in research there is a growing need to make further choices about what to keep for the long-term, selecting what data to make available or to dispose of. The best time to do this data appraisal is well before the end of the project, or periodically if it’s a longitudinal or reference data collection.
This guide aims to help you make what may be quite difficult choices around what data to keep in order to meet your own purposes and satisfy your institution and external funders. You may have a number of choices about who will look after your data:
The choice may be straightforward if you have an established data management facility in your domain,[6] or even within your research group or department. Your research funder may recommend a data centre or self-deposit archive. For example the UK Data Service offers social scientists the ReShare archive (reshare.ukdataservice.ac.uk). When choosing it is important to considering factors such as whether the repository:
A forthcoming DCC guide offers further help in selecting external repositories. Your institution may offer Research Data Management support to help you deal with these issues and get the most out of the investment put into your research. This could involve:
The guide takes you through the following five steps:
This guide draws mainly on the existing DCC guide "How to Appraise and Select Research Data for Curation"[7] , the NERC Data Value Checklist [8], and the University of Bristol Research Data Evaluation Guide [9]. Section 4 is adapted from the UK Data Service’s Data management costing tool and checklist.
This guide uses a broad definition of research data "representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship". [10] When selecting what data to keep you will need to consider which of the following broad types [11] is suitable for reuse:
The guide also uses the term data collection for any collection of objects that would be needed to access and interpret the above e.g. notebook, protocol, software or set of instrument calibrations, whether wholly or partly digital. In some cases there may be a justification for keeping the software used to manipulate and analyse the data generated. That justification may be as strong as for keeping the data itself, e.g. if the software would be required to enable the results to be reproduced. In other cases it may be even stronger, e.g, if the results could be reproduced just from the algorithm that was used to carry out an analysis.
A single research project could easily produce a number of ‘data collections’, each matching a different potential use. Just as research may be written up for several audiences and publishers, data collections could be produced and deposited in different repositories. Each data collection may itself comprise various digital files that need different access permissions and/or licence conditions. You will need to plan how to package these up into collections, taking account of your intended repository’s terms and conditions for depositing data. If you will need to deposit different kinds of outputs with different services, you may need to plan to organise them in multiple data collections.
The phrase 'long-term' is used in this guide to mean ‘beyond the end of your research project’. Or, if the research data is contributing to a reference collection or longitudinal study, its long-term value could be assessed periodically, e.g., every 3 years. Specific guidance on how long data should be retained may be available in your institution or funder’s Research Data Management Policy. The DCC also provides guidance on funding body policies.[12]
Other sources that can help you assess what to keep, if they exist for your project, include a Data Management Plan produced when the research was conceived - this may identify possible long-term uses of the data.[13] while there may also be a Pathway to Impact statement holding some ideas about longer term objectives for research outputs.
Consider the purpose or ‘reuse case’ that the data could serve beyond the research context in which it was created or collected. Any one of the following 7 reasons could justify retaining data for long-term access. Many, though not all, involve making it accessible beyond your research group, at least once you have had the opportunity for first use.
Reasons 1 and 2 in particular overlap with funding bodies’ policy aims, as these typically focus on ensuring the integrity of published research findings and on maximising return on the investment in data. The funders’ main concern is to preserve the ‘data behind the graph [15], but the onus is on researchers and those who directly support them to translate policy into meaningful guidelines. Contractual and other legal obligations may also come into play. We return to these under ‘What data must be kept’ below.
To help your decision making you could match up the reuse purposes most relevant to your research against types of data likely to be needed, as in Table 1 below [16].
Reuse case
Preservation guideline
Further publications
Referenced data with additional documentation
Learning & teaching
Samples of source & assembled data with analysis scripts
Verification
Referenced data plus analysis scripts
Further analysis
All source data plus software used to collect
Table 1. Example preservation guideline
Generally the decision on what ‘must’ be kept will depend on the data creator’s priorities, i.e. on how valuable the data is for the purposes identified above, considering the costs of preparing it for long-term use. But the decision will also need to account for legal, regulatory or policy compliance issues. At the point of deciding what to keep these mostly concern whether data should be publicly available or have restricted access, on what terms and conditions it should be accessible, and ensuring that risks of non-compliance are addressed.
In this step consider the basic questions below, to help identify these. Seek further advice from your institution’s Research Data Management service, or similar support staff e.g., Records Manager if you are unsure whether risks are best addressed by keeping the data or disposing of it and, in either case, how securely this needs to be done.
UK Research Council research data policy principles emphasise that data with "...acknowledged long-term value" should be retained.[17] Journals, learned societies and professional associations are active in defining what this means in individual disciplines. Decisions on what ‘must’ be kept will need to take account of any relevant funder or institutional policies.[18] But what exactly counts as data of "acknowledged long-term value" will be grounded on their creator’s in depth knowledge of that data and what is likely to be of value. So the most basic indicators that you must keep it are if you answer ‘yes’ to either of these questions:
A ‘yes’ here will indicate you should keep the ‘data behind the graph’ (discussed above). Step 3 below gives more help on working out anything else that may be of ‘long-term value’.
The main questions here are:
Legal regulations covering Freedom of Information and Environmental Information require research data to be made available on request, if the research is complete and data relating to it is still available. This implies that any data that is kept should be clearly identified according to an information security classification scheme e.g. public access/internal/confidential/secret.[19] For general guidance on any exemptions that may apply to the data consult the Information Commissioners Office (ico.gov.uk) and Scottish Information Commissioner (www.itspublicknowledge.info) websites.
These are likely if the research has public policy implications, it involves a commercial partner, or has potential spin-off applications.
The Data Protection Act defines personal data and sets out criteria for deciding how long it should be kept, how it must be stored and requirements for disposal. If you answer ‘yes’ to all of the questions below, the next step is to follow guidance available from the Information Commissioner’s Office, including how to anonymise data if needed.[20] The UK Data Archive also offers guidance and provides a Secure Lab, allowing personal data to be used in academic research under strict controls.[21]
If your response is ‘no’ to any of the above you may be able to get help to resolve any issues from your institution’s Research Data Management service or Records Manager.
Bearing in mind the potential reuse purposes you identified earlier, consider the criteria and questions below to help decide which data should be retained and for what reason. As a general rule the data should be kept if you have already identified a compliance reason, or you can answer ‘yes’ to at least one of the questions under any two of the headings (criteria) below.
Tick any of the criteria that you expect the data to rate highly on, as far as this can be estimated. You can weight criteria differently according to the long-term aims the data needs to meet and how certain you are about its value in relation to those aims.
This step helps consider the economic case for keeping the data. It is important to consider the data management cost impact on your research commitments and your organisation’s budgetary constraints. If you have recently done that and can give an unequivocal ‘yes’ to each of the following questions you can skip this step.
You can use this section to estimate any shortfall in the time or other costs budgeted for data management.[22] Any costs that have already been incurred will count on the ‘value’ side of the economic case for keeping the data, while any shortfall will count against it. Your institution’s Research Data Management Service, Research Office, Library or IT service may be able to advise on how to meet commitments for data in the ‘must keep’ category.
Use the headings below, or any cost categories used in the Data Management Plan, to estimate how much has been spent on staff time, equipment/ hardware, or software and service charges, and how much still needs to be spent in these categories.
The table is only for your own purposes (figures do not need to be disclosed to anyone who would not otherwise have access to them). It should serve two purposes: firstly, to help identify the value accumulated in your data and, secondly, to identify any areas where you may need to seek external help to avoid the risk that this value cannot be realised.
Spend to date
Needed to complete
BudgetedLikely shortfall?
Creation, collection & cleaning
Creating a suitable consent form and obtaining consent for data sharing
Data transfer or transcription from sites, media or instruments
Description and documentation
Validation, checking or cleaning
Formatting and file organisation
Digitisation of paper or physical objects
Short-term storage & backup
Storage space for all working data for duration of project
Backup of all data for duration of project
Short-term access & security
Providing access and authentication for external collaborators or participants
Online and physical protection of data from unauthorised access or disclosure
Team communication & development
Data management meetings
Online collaboration, virtual research environment
Data management training
Preservation & long-term access
Copyright clearance, licensing
Classifying data sensitivity and anonymising personal data (if required)
Preparation for archiving, conversion to open file formats
Metadata for data citation, discovery and reuse
Data deposit charges
Long-term storage costs
Staff time (person hours)
</