Documenting data
Data documentation comprises any contextual and descriptive information needed to find, assess, understand, and (re)use research data.
Highly structured and machine-readable documentation is called metadata or 'data about the data'.
Why document data?
It is crucial to provide documentation for research data for various reasons:
- You and others can understand or interpret the data later
- You make results independently reproducible
- You avoid incorrect use/misinterpretation of your data
- You demonstrate what you did
- You allow other researchers to learn from your work
- You avoid duplication of efforts
- You allow to automatically find and use the data to operate at larger scope, scale and speed
As such, documentation is an essential step in making your data FAIR.
When?
The pitfall in the process of data documentation is procrastination. You can quickly forget why you generated or processed the data the way you did or what certain abbreviations mean.
Best practice is to start gathering meaningful information early on in the research process and maintain this consistently throughout the entire project.
Documentation levels
Data should be documented at multiple levels: study, file, object and individual item level.
At project or study level you register contextual information about a study/project:
- background, aims, and hypotheses
- procedural and methodological information
At file or database level you document the file inventory and relations or the database structure.
At sample or study object level you keep information about the characteristics of samples, study participants or subsets of data. These can be identifiers, sample size, settings, time and location and any sampling biases or limitations.
At item or variable level you document individual data points or variables within a dataset, along with their definitions and relevant details:
- labels, units of measurement, abbreviations
- any relevant codes or categories
Data documentation methods
Data documentation methods vary widely in type and form. In this section we group them according to:
- File organisation
- Readme files
- Notebooks
- Surveys and interviews
- Images
- Variable documentation
- Reference management
- Computational research
File organisation
Filenames and the organization of folders provide important context about a collection of data.
- Also see this knowledge clip on keeping data organized.
The RDM Workspace is a structured SharePoint template designed to store information about a research project. It is included in the Ghent University M365 system and allows efficient collaboration, also with external collaborators, on files, libraries and lists.
- More information about the RDM Workspace.
Maintain a Data List to keep an overview of files that are collected (e.g. interview recordings, transcripts). By listing the data items and their contextual information and adding a unique identifier, you will be able to easily locate the relevant data.
- A data list template can be found via the UK Data Service's Data-level documentation page.
The Open Science Framework (OSF) is available through a Universiteit Gent institutional membership, for projects that endorse open and early sharing of research. OSF supports the entire research workflow, by enabling researchers to collaborate, document, archive, share, and register research projects and data.
- Getting started on the OSF.
Readme files
A Readme file collects information to help ensure that files can be understood/interpreted correctly. As such it complements or is an alternative to folder hierarchy and file naming. When creating a Readme file, think about
- What information will be required? Consider the W-questions: what, when, where, why
- Will you place the Readme files per subfolder or per study?
- Will you use a template to create the Readme file or is automation possible?
- When will you update the Readme files?
Notebooks
OneNote is a general purpose and extremely flexible note keeping application, integrated in the Ghent University Microsoft 365 environment.
- The Microsoft support site introduces OneNote
An Electronic Lab Notebook or ELN is a fully compliant note keeping tool, suitable for multiple disciplines and it includes an integrated sample management module.
An ELN is a digital replacement for paper lab notebooks, and is indispensible in case of valorisation potential to avoid discussions about ownership and to meet contractual obligations when performing research in collaboration with third parties.
- More information about the RSpace ELN for Ghent University: https://onderzoektips.ugent.be/en/tips/00002245/
Obsidian is a personal knowledgebase and note-taking application that operates on Markdown files. It can be installed from the Obsidian website, but is not supported by the University Services - ICT.
- Obsidian example
Surveys and interviews
Survey platforms allow to collect data, and can at the same time include a lot of documentation about the research. Available survey platforms at Ghent University are
- Microsoft Forms: M365 kennisplatform
- Qualtrics: https://onderzoektips.ugent.be/en/tips/00002102/
- REDCap: https://hiruz.be/dmu
You can use NVivo to document qualitative data. NVivo is available for Ghent University researchers through Athena.
- For more information about NVivo, see https://onderzoektips.ugent.be/nl/tips/00001699/
Images
You can do basic annotation of images on a personal computer using Tropy: https//tropy.org (not supported by University Services - ICT).
For technical images such as microscopy images, scans or spectra, there are dedicated solutions available. Think about
- LOGS repository https://logs.sciy.com/ (not supported by University Services - ICT)
- Omero for microscopy images https://www.openmicroscopy.org/omero/ (not supported by University Services - ICT)
- XNAT for clinical images https://www.xnat.org/ (not supported by University Services - ICT)
None of the image management tools mentioned here are supported by the University Services - ICT, but some research departments may have experience. Server capacity to host your own installation can be requested via the ICT helpdesk website.
Variable documentation
Information about data items can be recorded in a Codebook. Typically this is a separate file, but some data formats allow to embed this information within the data file (e.g. the SPSS .sav file format). Make sure to check that all embedded information is still available if you convert files to another format.
- Example Codebook: Sciensano COVID19 Codebook
Best practices for variable naming
- Use meaningful abbreviations such as “rtms” (reaction time in milliseconds).
- Use variable names that are related to your data collection method, e.g. survey question numbers
- Avoid a simplistic numerical order system
- Be consistent across versions of datasets
- Do not use spaces or special characters
Reference management
Reference software, also known as bibliographic software, allows you to collect, manage and use literature information in a systematic way. More information is available in this research tip: https://onderzoektips.ugent.be/en/tips/00001541
Computational research
GitHub is a collaboration and version control system that allows you to document changes in computer files e.g. software code or algorithms. There is a Ghent University enterprise version available which requires a Ghent University login, or the public GitHub.com for external collaboration or sharing.
You can collect even more documentation using code notebooks, as well as use these to build reproducible workflows
- Jupyter: https://jupyter.org/
- R markdown: https://rmarkdown.rstudio.com/
Both applications are available as interactive apps on the Ghent University HPC infrastructure. Rstudio is also available in the Company Portal for intune PCs.
Metadata
Metadata are 'data about data', used to describe and annotate data. They are a highly structured, machine-readable form of data documentation.
Metadata are essential to make your data FAIR (Findable, Accessible, Interoperable and Reusable). They allow you to describe your data in a way that it can be found online, and to indicate unambiguously how the data can be accessed and re-used (access level and license).
Metadata can be embedded within data files, captured in separate files, or recorded via metadata forms of a data management solution e.g. SharePoint or a data repository.
Metadata schemas
Typically, you will provide metadata when depositing data in a data repository, according to the metadata standard which is used by that repository. This is done by filling out a form or template according to the submission guidelines of the data repository. Make sure to check the data submission guidelines in time, so you can plan to collect the required information.
Metadata can comprise a fixed set of elements, as defined by a particular metadata standard. Metadata standards for research data can be general, such as the Dublin Core metadata standard, or domain-specific, such as the DDI standard for social, behavioral, economic and health sciences data or the Ecological Metadata Language (EML).
There are a number of resources where you can look for metadata standards:
- Data submission information pages of data repositories
- DataCite metadata schema
- DublinCore metadata schema
- Metadata Standards Catalog
- DCC, overview of Disciplinary metadata
- Fairsharing.org
More information
- Software available at Ghent University
- J. Riley (2017), Understanding metadata
- Dublin Core Metadata generator
- CESSDA ERIC, Documentation and Metadata
- The Turing way online book
- Using OneNote as a research notebook
- Elixir RDM Guide on metadata in practice