Retrieving Data from Remotes
A key feature of SciDataFlow is that it can quickly reunite a project's code repository with its data. Imagine a colleague had a small repository containing the code lift a recombination map over to a new reference genome, and you'd like to use her methods. However, you also want to check that you can reproduce her pipeline on your system, which first involves re-downloading all the input data (in this case, the original recombination map and liftover files).
First, you'd clone the repository:
$ git clone git@github.com:mclintock/maize_liftover
$ cd maize_liftover/
Then, as long as a data_manifest.yml
exists in the root project directory
(maize_liftover/
in this example), SciDataFlow is initialized. You can verify
this by using:
$ sdf status --remotes
Project data status:
1 file local and tracked by a remote (0 files only local, 0 files only remote), 1 file total.
[data > Zenodo]
recmap_genome_v1.tsv deleted, tracked 7ef1d10a exists on remote
recmap_genome_v2.tsv deleted, tracked e894e742 exists on remote
Now, to retrieve these files, all you'd need to do is:
$ sdf pull
Downloaded 1 file.
- population_sizes.tsv
Skipped 0 files. Reasons:
Note that if you run sdf pull
again, it will not redownload the file (this is
to over overwriting the local version, should it have been changed):
$ sdf pull
No files downloaded.
Skipped 1 files. Reasons:
Remote file is indentical to local file: 1 file
- population_sizes.tsv
If the file has changed, you can pull in the remote's version with sdf pull --overwrite
. However, sdf pull
is also lazy; it will not download the file
if the MD5s haven't changed between the remote and local versions.
Downloads with SciDataFlow are fast and concurrent thanks to the Tokio Rust Asynchronous Universal download MAnager crate. If your project has a lot of data across multiple remotes, SciDataFlow will pull all data in as quickly as possible.