Introduction to SciDataFlow

First, make sure you have installed and configured SciDataFlow.

An Important Warning

⚠️Warning: SciDataFlow does not do data versioning. Unlike Git, it does not keep an entire history of data at each commit. Thus, data backup must be managed by separate software. SciDataFlow is still in alpha phase, so it is especially important you backup your data before using SciDataFlow. A tiny, kind reminder: you as a researcher should be doing routine backups already — losing data due to either a computational mishap or hardware failure is always possible.

The user interacts with the Data Manifest through the fast and concurrent command line tool sdf written in the inimitable Rust language. The sdf tool has a Git-like interface.

If you know Git, using it will be easy, e.g. to initialize SciDataFlow for a project you'd use:

$ sdf init

Registering a file in the manifest:

$ sdf add data/population_sizes.tsv
Added 1 file.

Checking to see if a file has changed, we'd use sdf status:

$ sdf status
Project data status:
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.

[data]
 population_sizes.tsv      current      3fba1fc3      2023-09-01 10:38AM (53 seconds ago)

Now, let's imagine a pipeline runs and changes this file:

$ bash tools/computational_pipeline.sh # changes data
$ sdf status 
Project data status:
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.

[data]
 population_sizes.tsv      changed      3fba1fc3 → 8cb9d10b        2023-09-01 10:48AM (1 second ago)

If these changes are good, we can tell the Data Manifest it should update its record of this version:

$ sdf update data/population_sizes.tsv
$ sdf status
Project data status:
0 files on local and remotes (1 file only local, 0 files only remote), 1 file total.

[data]
 population_sizes.tsv      current      8cb9d10b      2023-09-01 10:48AM (6 minutes ago)

Using SciDataFlow

Introduction to SciDataFlow

An Important Warning