Workflow

Workflow is a base R6 class representing a bioinformatic workflow run from a UMCCR workflow manager.

A workflow has:

a directory path with all the raw output files (either on GDS, S3 or local filesystem)
a subset of files that are of interest for ingestion
- tibble with full path and basename columns
a set of parsers that can parse and tidy those files
- each parser takes a path and returns a tidy tibble
a list of tidy tibbles (or a tibble with nested tibbles)

Public fields

path: Path to directory with raw workflow results (from GDS, S3, or local filesystem).
wname: Name of workflow (e.g. umccrise, sash).
filesystem: Filesystem of path (gds/s3/local).
regexes: Tibble with file regex and function to parse it.

Methods

Method `new()`

Create a new Workflow object.

Usage

Wf$new(path = NULL, wname = NULL, regexes = NULL)

Arguments

path: Path to directory with raw workflow results.
wname: Name of workflow.
regexes: Tibble with file regex and function to parse it.

Method `print()`

Print details about the Workflow.

Usage

Wf$print(...)

Arguments

...: (ignored).

Method `list_files()`

List all files under given path.

Usage

Wf$list_files(
  path = self$path,
  max_files = 1000,
  ica_token = Sys.getenv("ICA_ACCESS_TOKEN"),
  ...
)

Arguments

path: Path with raw results.
max_files: Max number of files to list (for gds/s3 only).
ica_token: ICA access token (def: $ICA_ACCESS_TOKEN env var).
...: Passed on to gds_list_files_dir function.

Method `list_files_filter_relevant()`

List dracarys files under given path

Usage

Wf$list_files_filter_relevant(
  path = self$path,
  max_files = 1000,
  ica_token = Sys.getenv("ICA_ACCESS_TOKEN"),
  ...
)

Arguments

path: Path with raw results.
max_files: Max number of files to list (for gds/s3 only).
ica_token: ICA access token (def: $ICA_ACCESS_TOKEN env var).
...: Passed on to the gds_list_files_filter_relevant or the s3_list_files_filter_relevant function.

Method `download_files()`

Download files from GDS/S3 to local filesystem.

Usage

Wf$download_files(
  path = self$path,
  outdir,
  ica_token = Sys.getenv("ICA_ACCESS_TOKEN"),
  max_files = 1000,
  dryrun = FALSE,
  recursive = NULL
)

Arguments

path: Path with raw results.
outdir: Path to output directory.
ica_token: ICA access token (def: $ICA_ACCESS_TOKEN env var).
max_files: Max number of files to list.
dryrun: If TRUE, just list the files that will be downloaded (don't download them).
recursive: Should files be returned recursively in and under the specified GDS directory, or only directly in the specified GDS directory (def: TRUE via ICA API).

Method `tidy_files()`

Tidy given files.

Usage

Wf$tidy_files(x)

Arguments

x: Tibble with localpath to file and the function type to parse it.

Method `write()`

Write tidy data.

Usage

Wf$write(x, outdir = NULL, prefix = NULL, format = "tsv", drid = NULL)

Arguments

x: Tibble with tidy data list-column.
outdir: Directory path to output tidy files.
prefix: Prefix of output files.
format: Format of output files.
drid: dracarys ID to use for the dataset (e.g. wfrid.123, prid.456).

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Wf$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

if (FALSE) { # \dontrun{
regexes <- tibble::tribble(
  ~regex, ~fun,
  "-chord\\.tsv\\.gz$", "UmChordTsvFile",
  "-hrdetect\\.tsv\\.gz$", "UmHrdetectTsvFile",
  "-snv_2015\\.tsv\\.gz$", "UmSigsSnvFile",
  "-snv_2020\\.tsv\\.gz$", "UmSigsSnvFile",
  "-dbs\\.tsv\\.gz$", "UmSigsDbsFile",
  "-indel\\.tsv\\.gz$", "UmSigsIndelFile",
  "-qc_summary\\.tsv\\.gz$", "UmQcSumFile",
)

#---- LOCAL ----#
p1_local <- "~/icav1/g/production/analysis_data"
p <- file.path(p1_local, "SBJ01155/umccrise/202408300c218043/L2101566__L2101565")
um1 <- Wf$new(path = p, wname = "umccrise", regexes = regexes)
um1$list_files(max_files = 10)
um1$list_files_filter_relevant(max_files = 10)

#---- GDS ----#
p1_gds <- "gds://production/analysis_data"
p <- file.path(p1_gds, "SBJ03043/umccrise/20240830ec648f40/L2300064__L2300063")
outdir <- file.path(sub("gds:/", "~/icav1/g", p))
token <- Sys.getenv("ICA_ACCESS_TOKEN")
um2 <- Wf$new(path = p, wname = "umccrise", regexes = regexes)
um2$list_files(max_files = 10)
um2$list_files_filter_relevant(ica_token = token, max_files = 500)
d <- um2$download_files(
  outdir = outdir, ica_token = token,
  max_files = 1000, dryrun = T
)
d_tidy <- um2$tidy_files(d)

#---- S3 ----#
p1_s3 <- "s3://org.umccr.data.oncoanalyser/analysis_data/SBJ05570/sash/202408275fce06c3"
p2_s3 <- "L2401304_L2401303/SBJ05570_MDX240299/cancer_report/cancer_report_tables"
p <- file.path(p1_s3, p2_s3)
outdir <- sub("s3:/", "~/s3", p)
um3 <- Wf$new(path = p, wname = "sash", regexes = regexes)
um3$list_files(max_files = 10)
um3$list_files_filter_relevant(max_files = 50)
d <- um3$download_files(outdir = outdir, regexes = regexes, max_files = 50, dryrun = F)
} # }

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Method print()

Usage

Arguments

Method list_files()

Usage

Arguments

Method list_files_filter_relevant()

Usage

Arguments

Method download_files()

Usage

Arguments

Method tidy_files()

Usage

Arguments

Method write()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `print()`

Method `list_files()`

Method `list_files_filter_relevant()`

Method `download_files()`

Method `tidy_files()`

Method `write()`

Method `clone()`