Documenting data

Data documentation comprises any contextual and descriptive information needed to find, assess, understand, and (re)use research data.

Highly structured and machine-readable documentation is called metadata or 'data about the data'.

Why document data?

It is crucial to provide documentation for research data for various reasons:

  • You and others can understand or interpret the data later
  • You make results independently reproducible
  • You avoid incorrect use/misinterpretation of your data
  • You demonstrate what you did
  • You allow other researchers to learn from your work
  • You avoid duplication of efforts
  • You allow to automatically find and use the data to operate at larger scope, scale and speed

As such, documentation is an essential step in making your data FAIR.

When?

The pitfall in the process of data documentation is procrastination. You can quickly forget why you generated or processed the data the way you did or what certain abbreviations mean.

Best practice is to start gathering meaningful information early on in the research process and maintain this consistently throughout the entire project.

Documentation levels

Data should be documented at multiple levels: study, file, object and individual item level.

At project or study level you register contextual information about a study/project:

  • background, aims, and hypotheses
  • procedural and methodological information

At file or database level you document the file inventory and relations or the database structure.

At sample or study object level you keep information about the characteristics of samples, study participants or subsets of data. These can be identifiers, sample size, settings, time and location and any sampling biases or limitations.

At item or variable level you document individual data points or variables within a dataset, along with their definitions and relevant details:

  • labels, units of measurement, abbreviations
  • any relevant codes or categories

Data documentation methods

Data documentation methods vary widely in type and form. In this section we group them according to:

File organisation

Filenames and the organization of folders provide important context about a collection of data.

The RDM Workspace is a structured SharePoint template designed to store information about a research project. It is included in the Ghent University M365 system and allows efficient collaboration, also with external collaborators, on files, libraries and lists.

Maintain a Data List to keep an overview of files that are collected (e.g. interview recordings, transcripts). By listing the data items and their contextual information and adding a unique identifier, you will be able to easily locate the relevant data.

The Open Science Framework (OSF) is available through a Universiteit Gent institutional membership, for projects that endorse open and early sharing of research. OSF supports the entire research workflow, by enabling researchers to collaborate, document, archive, share, and register research projects and data.

Readme files

A Readme file collects information to help ensure that files can be understood/interpreted correctly. As such it complements or is an alternative to folder hierarchy and file naming. When creating a Readme file, think about

  • What information will be required? Consider the W-questions: what, when, where, why
  • Will you place the Readme files per subfolder or per study?
  • Will you use a template to create the Readme file or is automation possible?
  • When will you update the Readme files?

Notebooks

OneNote is a general purpose and extremely flexible note keeping application, integrated in the Ghent University Microsoft 365 environment.

An Electronic Lab Notebook or ELN is a fully compliant note keeping tool, suitable for multiple disciplines and it includes an integrated sample management module.

An ELN is a digital replacement for paper lab notebooks, and is indispensible in case of valorisation potential to avoid discussions about ownership and to meet contractual obligations when performing research in collaboration with third parties.

Obsidian is a personal knowledgebase and note-taking application that operates on Markdown files. It can be installed from the Obsidian website, but is not supported by the University Services - ICT.

Surveys and interviews

Survey platforms allow to collect data, and can at the same time include a lot of documentation about the research. Available survey platforms at Ghent University are

You can use NVivo to document qualitative data. NVivo is available for Ghent University researchers through Athena.

Images

You can do basic annotation of images on a personal computer using Tropy: https//tropy.org (not supported by University Services - ICT).

For technical images such as microscopy images, scans or spectra, there are dedicated solutions available. Think about

None of the image management tools mentioned here are supported by the University Services - ICT, but some research departments may have experience. Server capacity to host your own installation can be requested via the ICT helpdesk website.

Variable documentation

Information about data items can be recorded in a Codebook. Typically this is a separate file, but some data formats allow to embed this information within the data file (e.g. the SPSS .sav file format). Make sure to check that all embedded information is still available if you convert files to another format.

Best practices for variable naming

  • Use meaningful abbreviations such as “rtms” (reaction time in milliseconds).
  • Use variable names that are related to your data collection method, e.g. survey question numbers
  • Avoid a simplistic numerical order system
  • Be consistent across versions of datasets
  • Do not use spaces or special characters

Reference management

Reference software, also known as bibliographic software, allows you to collect, manage and use literature information in a systematic way. More information is available in this research tip: https://onderzoektips.ugent.be/en/tips/00001541

Computational research

GitHub is a collaboration and version control system that allows you to document changes in computer files e.g. software code or algorithms. There is a Ghent University enterprise version available which requires a Ghent University login, or the public GitHub.com for external collaboration or sharing.

You can collect even more documentation using code notebooks, as well as use these to build reproducible workflows

Both applications are available as interactive apps on the Ghent University HPC infrastructure. Rstudio is also available in the Company Portal for intune PCs.

Metadata

Metadata are 'data about data', used to describe and annotate data. They are a highly structured, machine-readable form of data documentation.

Metadata are essential to make your data FAIR (Findable, Accessible, Interoperable and Reusable). They allow you to describe your data in a way that it can be found online, and to indicate unambiguously how the data can be accessed and re-used (access level and license).

Metadata can be embedded within data files, captured in separate files, or recorded via metadata forms of a data management solution e.g. SharePoint or a data repository.

Metadata schemas

Typically, you will provide metadata when depositing data in a data repository, according to the metadata standard which is used by that repository. This is done by filling out a form or template according to the submission guidelines of the data repository. Make sure to check the data submission guidelines in time, so you can plan to collect the required information.

Metadata can comprise a fixed set of elements, as defined by a particular metadata standard. Metadata standards for research data can be general, such as the Dublin Core metadata standard, or domain-specific, such as the DDI standard for social, behavioral, economic and health sciences data or the Ecological Metadata Language (EML).

There are a number of resources where you can look for metadata standards:

More information