Teaching Research Data Management with DataLad: A Multi-year, Multi-domain Effort

Since the first release (0.0.1, March 2015), DataLad had technical documentation with a design overview and reference documentation. Although any amount of documentation is better than no documentation at all, existing documentation can still be insufficient if it does not meet the needs of the target audience. Solely technical or reference documentation, for example, can be suboptimal for novices: it may be incomplete, narrowly focused on individual commands, or assume existing knowledge readers lack (Segal, 2007; Pawlik et al., 2015), and can thereby discourage potential users or inhibit the adoption of a tool. Even though technical documentation is useful for developers, a central target audience for documentation of the DataLad ecosystem are scientists. A considerable part of this target audience can thus be considered technical novices for whom technical documentation is not ideal. Research also suggests that scientists need documentation to go beyond reference manuals. In an analysis of user questions in support forums of scientific software packages, Swarts (2019) found that the focus in 80% of inquiries was on operations and tasks, such as a required sequence of operations to achieve a specific goal, instead of reference lists. In breaking down user questions by purpose, Swarts (2019) further found that users were most interested in a description of operations or tasks, followed by insights about the reasons behind the action. Separating documentation types into feature-based (closer related to the concept of reference documentation) or task-based, Swarts (2019) reports twice as many questions seeking explanations in software with feature-based compared to task-based documentation. This hints at a disconnect between knowing how something should be done and why it should be done this way. Overall, this highlights that users of scientific software show a clear need beyond the documentation of individual commands, but seek to understand general usage principles and master complex combinations of features to achieve specified goals. This type of empowerment is what the DataLad Handbook project aimed to achieve by complementing DataLad’s existing technical documentation.

Design Considerations

We identified three types of stakeholders with different needs: researchers, planners and trainers. Researchers need accessible educational content to understand and use the tool; planners, such as principal investigators or funders, need high-level, non-technical information in order to make informed yet efficient decisions on whether the tool fulfills their needs; and trainers need reliable, open access teaching material. Based on this assessment, the following goals for the Handbook’s contents were set:

Applicability for a broad audience: The Handbook should showcase domain-agnostic, real-world RDM applications.

Practical experience: The Handbook should enable a code-along style usage, with examples presented in code that users can copy, paste, and run on their own computer. To allow a read-only style usage, too, the Handbook should also reveal what a given code execution’s output would look like. For an optimal code-along or read-only experience, the code output should match the current software behavior.

Suitable for technical novices: The Handbook’s language should be accessible. Gradually, by explaining technical jargon and relevant tools or concepts in passing, it should provide readers with a broad set of relevant RDM skills rather than requiring prior knowledge.

Low barrier to entry: The Handbook’s contents should be organized in short, topical units to provide the possibility to re-read or mix and match.

Integrative workflows: The Handbook’s contents should build up on each other and link back to content already introduced to teach how different software features interact.

Empowering independent users: Instead of showcasing successful code only, it should also explicitly demonstrate common errors to enable users to troubleshoot problems in their own use cases independently.

The following structure arose from this specification analysis (Wagner et al., 2020):

Introduction::

The first part of the Handbook, covering high-level descriptions of the software and its features and detailed installation instructions for all operating systems.

Basics::

The second part of the Handbook, written in the form of a continuous, code-along tutorial, set in a domain-agnostic fictional storyline about an RDM application, and covering all stable software features in chapters that build up on one another.

Advanced::

The third part of the Handbook covering features beyond the basics in stand-alone chapters, added prior to the second release.

Use cases::

The last part of the Handbook, containing short, standalone start-to-end descriptions of real-world use cases, with concise step-by-step instructions, and references to further reading in the Basics part.

Finally, the design and content requirements were accompanied by technical goals: from using expandable details to keep visible "core" text short and making the Handbook available in multiple formats, to developing the Handbook alongside the versioned software and using integration tests to ensure functioning of included code examples. The resulting implementation of the Handbook fulfilled these requirements as follows.

The Technical Backbone

The development environment of the Handbook was chosen with the intent to support declared goals, and to maximize configurability, autonomy, and reusability of the project. It builds up entirely on flexible and extendable open source infrastructure: on the highest level, it uses Sphinx as a documentation generator (sphinx-doc.org). Sphinx transforms documents written in reStructuredText, a lightweight markup language, to a variety of output formats, among them HTML, PDF, LaTeX, or EPUB. Initially a by-product of the Python documentation, it has been adopted by the Open Source community at large; GitHub’s dependency graph reports that it is used by more than 300.000 projects in January 2024Footnote 2.

Sphinx supports an extension mechanism with which additional functionality can be integrated. Leveraging this mechanism, the Handbook project extended standard Sphinx features with custom admonitions and designs, for example toggle-able boxes for optional details. This is implemented as a Python package alongside the Handbook source code, making the Handbook project a reusable and installable Sphinx extension. Figure 1 provides an overview of the custom-developed design features. A major functional enhancement is provided with a separate Python package, autorunrecord, an additional custom-made Sphinx extension that allows sequential execution of code in a specified environment, and embedding a record of the code and its output as code snippets into the documentationFootnote 3. Instructors can further use it to automatically create scripts from selected code blocks which can then be demonstrated in a remote-controlled terminal in live-coding tutorials.

Fig. 1

Custom admonitions and code blocks used in the Handbook. In each pair of admonitions, the top image corresponds to the web version, and the bottom image corresponds to its PDF rendering. Windows-wits (green), toggle-able in the HTML version, contain information that is only relevant for the Windows operating system (DataLad supports GNU/Linux, MacOS, and Windows, but the latter is fundamentally different compared to the other two, sometimes leading to different behaviour or necessitating workarounds when using DataLad). Find-out-more admonitions (orange), also toggle-able in the HTML version, contain miscellaneous extra information for curious readers. Git user notes (blue) are colored boxes with references to the underlying tools of DataLad, intended for advanced Git users as a comparison or technical explanation. Code blocks show one or more commands and the resulting output, provided using the autorunrecord Sphinx extension. In the web version, a copy-button (top right corner) allows to copy relevant commands automatically to the clipboard. Internal annotations allow generating custom scripts from any sequence of code-blocks for live coding demonstrations

Hosting for the project is provided by Read the Docs (readthedocs.org), a full-featured software documentation deployment platform that integrates with Sphinx. Notably, it supports semantic versioning of documentation, which helps to ensure that users of a past software version can find the corresponding version of the conjointly developed Handbook. Illustrations in the Handbook are based on the undraw project by Katerina Limpitsouni (undraw.co).

The ability of the documentation to sequentially execute code and record its outcomes allows using the Handbook as an integration test for the DataLad software in addition to a user guide. If new software developments in the DataLad core packages break documented workflows, a continuous integration test suite will fail, alerting developers to the fact that their changes break user workflows.

To ensure reusability, such as the adaptation by Brooks et al. (2021), the project is released under a CC-BY-SA 4.0 license. Under its terms, all elements can be reused in original or derived form for all purposes under the condition that the original project is attributed and that derivative work is shared under an identical (“not more restrictive”) licenseFootnote 4.

Content

As of January 2024, the web and PDF versions of the Handbook were organized into four parts – “Introduction”, “Basics”, “Advanced”, and “Use cases” – which comprised a total of 21 chapters. The “Introduction” part has two different target audiences: first, it provides researchers with detailed installation instructions, a basic general command line tutorial, and an overview of the Handbook. Beyond this, it gives a high-level overview of the software and its capabilities to planners.

The “Basics” part is organized into nine chapters. Following a narrative about a fictional college course on RDM, it teaches different aspects of DataLad functionality and general RDM to researchers in each topical chapter. Broadly, those topics can be summarized as follows: 1) Local version control, 2) Capturing and re-executing process provenance, 3) Data integrity, 4) Collaboration and distributed version control, 5) Configuration, 6) Reproducible data analysis, 7) Computationally reproducible data analysis, 8) Data publication, and 9) Error management.

The “Advanced” part includes independent chapters on advanced DataLad features and workflows, big data projects, DataLad use on computational clusters, DataLad’s internals, and selected DataLad extensions. The latter two parts are accompanied with code demonstrations, slides, executable notebooks, and/or video tutorials that trainers can reuse freely to teach tool use and improve scientific practice. The last part, “Use cases”, targets planners and researchers with short step-by-step instructions which show planners what is possible, and help researchers to connect their knowledge into larger workflows.

Project and Community Management

Ensuring the longevity of software projects beyond the duration of individual researchers’ contracts requires community building (Koehler Leman et al., 2020). A user-driven alternative to documentation by software developers, “Documentation Crowdsourcing”, has been successfully employed by the NumPy project (Pawlik et al., 2015). The Handbook project extends this concept beyond reference documentation. To achieve this, it is set up to encourage and welcome improvements by external contributors. The project is openly hosted on GitHub. Mirroring processes in larger crowd-sourced documentation projects such as “The Turing Way handbook for reproducible, ethical and collaborative research” (The Turing Way Community, 2022), credit is given for both code-based and non-code-based contributions. Contributors are recognized in the source repository, on the DataLad Website, and as co-authors in both the printed version of the Handbook and its Zenodo releases. As of January 2024, a total of 60 contributors provided input in the form of content, bug fixes, or infrastructure improvements.

Paperback Version

A digest of the Handbook was published via the Kindle Direct Publishing (KDP) print-on-demand service to make the Handbook available in a printed paperback version. This fulfilled user demands for physical copies of the documentation, and was possible with minimal additional technical work, building up on the automatically generated LaTeX sources of the Handbook. The printed book’s contents were sub-selected for longevity, graphics or graphical in-text elements were optimized for black-and-white printing, and a dedicated hyperlink index was created.

View original article

NEUROINFORMATICS

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Teaching Research Data Management with DataLad: A Multi-year, Multi-domain Effort

Comments (0)