Retrieving Data from Static URLs
Often we also want to retrieve data from URLs. For example, many
genomic resources are available for download from the
UCSC or Ensembl
websites as static URLs. We want a record of where these files come
from in the Data Manifest, so we want to combine a download with a
sdf add
. This is where sdf get
and sdf bulk
come in handy.
Downloading Data from URLs: sdf get
The command sdf get
does this all for you — let's
imagine you want to get all human coding sequences. You could do this with:
$ sdf get https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
⠄ [================> ] 9639693/22716351 (42%) eta 00:00:08
Now, it would show up in the Data Manifest:
$ sdf status --remotes
Project data status:
0 files local and tracked by a remote (0 files only local, 0 files only remote), 1 files total.
[data > Zenodo]
Homo_sapiens.GRCh38.cds.all.fa.gz current, untracked fb59b3ad 2023-09-01 3:13PM (43 seconds ago) not on remote
Note that files downloaded from URLs are not automatically track with remotes.
You can do this with sdf track <FILENAME>
if you want. Then, you can use sdf push
to upload this same file to Zenodo or FigShare.
Bulk Downloading Data with sdf get
Since modern computational projects may require downloading potentially
hundreds or even thousands of annotation files, the sdf
tool has a simple
way to do this: tab-delimited or comma-separated value files (e.g. those with
suffices .tsv
and .csv
, respectively). The big picture idea of SciDataFlow
is that it should take mere seconds to pull in all data needed for a large
genomics project (or astronomy, or ecology, whatever). Here's an example TSV
file full of links:
$ cat human_annotation.tsv
type url
cdna https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
fasta https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz
cds https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
Note that this has a header, and the URLs are in the second column. To get this data, we'd use:
$ sdf bulk human_annotation.tsv --column 2 --header
⠁ [ ] 0/2 (0%) eta 00:00:00
⠉ [====> ] 9071693/78889691 (11%) eta 00:01:22
⠐ [=========> ] 13503693/54514783 (25%) eta 00:00:35
Columns indices are one-indexed and sdf bulk
assumes no headers by
default. Note that in this example, only two files are downloading — this is
because sdf
detected the CDS file already existed. SciDataFlow tells you this
with a little message at the end:
$ sdf bulk human_annotation.tsv --column 1 --header
3 URLs found in 'human_annotation.tsv.'
2 files were downloaded, 2 added to manifest (0 were already registered).
1 files were skipped because they existed (and --overwrite was no specified).
Note that one can also download files from URLs that are in the Data Manifest. Suppose that you clone a repository that has no remotes, but each file entry has a URL set. Those can be retrieved with:
$ sdf pull --urls # if you want to overwrite any local files, use --ovewrite
These may or may not be tracked
; tracking only indicates whether to also
manage them with a remote like Zenodo or FigShare. In cases where the data file
can be reliable retrieved from a steady source (e.g. a website like the UCSC
Genome Browser or Ensembl) you may not want to duplicate it by also tracking
it. If you want to pull in everything, use:
$ sdf pull --all