Retrieving Data from Remotes

A key feature of SciDataFlow is that it can quickly reunite a project's code repository with its data. Imagine a colleague had a small repository containing the code lift a recombination map over to a new reference genome, and you'd like to use her methods. However, you also want to check that you can reproduce her pipeline on your system, which first involves re-downloading all the input data (in this case, the original recombination map and liftover files).

First, you'd clone the repository:

$ git clone git@github.com:mclintock/maize_liftover
$ cd maize_liftover/

Then, as long as a data_manifest.yml exists in the root project directory (maize_liftover/ in this example), SciDataFlow is initialized. You can verify this by using:

$ sdf status  --remotes
Project data status:
1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total.

[data > Zenodo]
 recmap_genome_v1.tsv      deleted, tracked      7ef1d10a            exists on remote
 recmap_genome_v2.tsv      deleted, tracked      e894e742            exists on remote

Now, to retrieve these files, all you'd need to do is:

$ sdf pull 
Downloaded 1 file.
 - population_sizes.tsv
Skipped 0 files. Reasons:

Note that if you run sdf pull again, it will not redownload the file (this is to over overwriting the local version, should it have been changed):

$ sdf pull
No files downloaded.
Skipped 1 files. Reasons:
  Remote file is indentical to local file: 1 file
   - population_sizes.tsv

If the file has changed, you can pull in the remote's version with sdf pull --overwrite. However, sdf pull is also lazy; it will not download the file if the MD5s haven't changed between the remote and local versions.

Downloads with SciDataFlow are fast and concurrent thanks to the Tokio Rust Asynchronous Universal download MAnager crate. If your project has a lot of data across multiple remotes, SciDataFlow will pull all data in as quickly as possible.