{tidywigits} is an R package that parses and tidies outputs from the WiGiTS suite of genome and transcriptome analysis tools for cancer research and diagnostics, created by the Hartwig Medical Foundation.
In short, it traverses through a directory containing results from one or more runs of WiGiTS tools, parses any files it recognises, tidies them up (which includes data reshaping, normalisation, column name cleanup etc.), and writes them to the output format of choice e.g. Apache Parquet, PostgreSQL, Delta table.
{tidywigits} is built on top of R’s R6
encapsulated object-oriented programming implementation, which helps
with code organisation. It consists of several base classes like
Config
, Tool
, and Workflow
which
we describe below. Each R6 class can contain public and private
functions and non-functions (fields).
Config
A Config
object contains functionality for interacting
with YAML configuration files that are part of {tidywigits}. These
configuration files (under inst/config
) specify the
schemas, types, patterns and field descriptions for the raw
input files and tidy output tbls. See
?Config
.
raw
Let’s look at some of the information for the raw PURPLE config, for instance:
tool <- "purple"
toolu <- toupper(tool)
conf <- Config$new(tool)
conf
#> #--- Config purple ---#
#>
#> |var |value |
#> |:-----|:------|
#> |tool |purple |
#> |nraw |10 |
#> |ntidy |10 |
You can access the individual fields in the classic R list-like
manner, using the $
sign.
Patterns are used to fish out the relevant files from a directory
listing. Note that the \
needs to be doubled in the R code
since it’s an escaped character.
conf$get_raw_patterns() |>
knitr::kable(caption = glue("{toolu} raw file patterns."))
name | value |
---|---|
cnvgenetsv | .purple.cnv.gene.tsv$ |
cnvsomtsv | .purple.cnv.somatic.tsv$ |
drivercatalog | .purple.driver.catalog.(germline|somatic).tsv$ |
germdeltsv | .purple.germline.deletion.tsv$ |
purityrange | .purple.purity.range.tsv$ |
puritytsv | .purple.purity.tsv$ |
qc | .purple.qc$ |
somclonality | .purple.somatic.clonality.tsv$ |
somhist | .purple.somatic.hist.tsv$ |
version | ^purple.version$ |
File descriptions based on the Hartwig documentation.
conf$get_raw_descriptions() |>
knitr::kable(caption = glue("{toolu} raw file descriptions."))
name | value |
---|---|
cnvgenetsv | Copy number alterations of each gene in the HMF gene panel. |
cnvsomtsv | Copy number profile of all (contiguous) segments of the tumor sample. |
drivercatalog | Significant amplifications and deletions that occur in the HMF gene panel. |
germdeltsv | Germline deletions. |
purityrange | Best fit per purity sorted by score. |
puritytsv | Purity/ploidy fit summary. |
qc | QC metrics. |
somclonality | Clonality peak model data. |
somhist | Somatic variant histogram data. |
version | Version of the tool. |
Versions are used to distinguish changes in schema between individual
tool versions. For example, after LINX v1.25, several columns were
dropped from the breakends
table, which is reflected in the
available LINX schemas. For now we are using latest
as a
default version based on the most recent schema tests, and any
discrepancies we see are labelled accordingly by the version of the tool
that generated a file with a different schema.
conf$get_raw_versions() |>
knitr::kable(caption = glue("{toolu} raw file versions."))
name | value |
---|---|
cnvgenetsv | latest |
cnvsomtsv | latest |
drivercatalog | latest |
germdeltsv | latest |
purityrange | latest |
puritytsv | latest |
qc | latest |
somclonality | latest |
somhist | latest |
version | latest |
The raw schemas specify the column name and type (e.g. character
(c
), integer (i
), float/double
(d
)) for each input file (just showing a couple below):
(s <- conf$get_raw_schemas_all())
#> # A tibble: 10 × 3
#> name version schema
#> <chr> <chr> <list>
#> 1 cnvgenetsv latest <tibble [18 × 2]>
#> 2 cnvsomtsv latest <tibble [16 × 2]>
#> 3 drivercatalog latest <tibble [17 × 2]>
#> 4 germdeltsv latest <tibble [16 × 2]>
#> 5 purityrange latest <tibble [6 × 2]>
#> 6 puritytsv latest <tibble [25 × 2]>
#> 7 qc latest <tibble [2 × 2]>
#> 8 somclonality latest <tibble [6 × 2]>
#> 9 somhist latest <tibble [3 × 2]>
#> 10 version latest <tibble [2 × 2]>
s |>
dplyr::filter(name == "puritytsv") |>
dplyr::select("schema") |>
tidyr::unnest("schema")
#> # A tibble: 25 × 2
#> field type
#> <chr> <chr>
#> 1 purity d
#> 2 normFactor d
#> 3 score d
#> 4 diploidProportion d
#> 5 ploidy d
#> 6 gender c
#> 7 status c
#> 8 polyclonalProportion d
#> 9 minPurity d
#> 10 maxPurity d
#> # ℹ 15 more rows
tidy
Now let’s look at some of the information in the tidy PURPLE config. The difference between raw and tidy configs is mostly in the column names (they are standardised to lowercase separated by underscores, i.e. snake_case), and some raw files get split into multiple tidy tables (e.g. for normalisation purposes).
Tidy descriptions are the same as the raw descriptions for now.
conf$get_tidy_descriptions() |>
knitr::kable(caption = glue("{toolu} tidy file descriptions."))
name | value |
---|---|
cnvgenetsv | Copy number alterations of each gene in the HMF gene panel. |
cnvsomtsv | Copy number profile of all (contiguous) segments of the tumor sample. |
drivercatalog | Significant amplifications and deletions that occur in the HMF gene panel. |
germdeltsv | Germline deletions. |
purityrange | Best fit per purity sorted by score. |
puritytsv | Purity/ploidy fit summary. |
qc | QC metrics. |
somclonality | Clonality peak model data. |
somhist | Somatic variant histogram data. |
version | Version of the tool. |
(s <- conf$get_tidy_schemas_all())
#> # A tibble: 10 × 4
#> name version tbl schema
#> <chr> <chr> <chr> <list>
#> 1 cnvgenetsv latest tbl1 <tibble [18 × 3]>
#> 2 cnvsomtsv latest tbl1 <tibble [16 × 3]>
#> 3 drivercatalog latest tbl1 <tibble [17 × 3]>
#> 4 germdeltsv latest tbl1 <tibble [16 × 3]>
#> 5 purityrange latest tbl1 <tibble [6 × 3]>
#> 6 puritytsv latest tbl1 <tibble [25 × 3]>
#> 7 qc latest tbl1 <tibble [12 × 3]>
#> 8 somclonality latest tbl1 <tibble [6 × 3]>
#> 9 somhist latest tbl1 <tibble [3 × 3]>
#> 10 version latest tbl1 <tibble [2 × 3]>
s |>
dplyr::filter(.data$name == "puritytsv") |>
dplyr::select("schema") |>
tidyr::unnest("schema")
#> # A tibble: 25 × 3
#> field type description
#> <chr> <chr> <chr>
#> 1 purity d purity of tumor in the sample
#> 2 norm_factor d internal factor to convert tumor ratio to cn.
#> 3 score d score of fit (lower is better)
#> 4 diploid_proportion d proportion of cn regions that have 1 (+- 0.2) mi…
#> 5 ploidy d average ploidy of the tumor sample after adjusti…
#> 6 gender c one of male or female
#> 7 status c either pass or one or more warning or fail status
#> 8 polyclonal_proportion d proportion of copy number regions that are more …
#> 9 min_purity d minimum purity with score within 10% of best
#> 10 max_purity d maximum purity with score within 10% of best
#> # ℹ 15 more rows
Tool
Tool
is the main organisation class for all file parsers
and tidiers. It contains functions for parsing and tidying typical
CSV/TSV files (with column names), and TXT files where the column names
are missing. Currently it utilises the very simple
readr::read_delim
function from the {readr}
package that reads all the data into memory. See ?Tool
.
These simple parsers are used in 80-90% of cases, so in the future we can optimise the parsing if needed with faster packages such as {data.table}, {duckdb-r} or {neo-r-polars}.
We can have different Tool
children classes that inherit
(or override) functions and fields from the Tool
parent
class. For example, we can create a Tool
object for PURPLE
as follows:
- Initialise a
Purple
object:
ppl_path <- system.file("extdata/purple", package = "tidywigits")
ppl <- Purple$new(path = ppl_path)
# each class comes with a print function
ppl
#> #--- Tool purple ---#
#>
#> |var |value |
#> |:------|:---------------------------------------------------------|
#> |name |purple |
#> |path |/home/runner/work/_temp/Library/tidywigits/extdata/purple |
#> |files |11 |
#> |tidied |FALSE |
- Its
Config
object is also constructed based on thename
supplied - this is used internally to find files of interest and infer their schemas:
ppl$config
#> #--- Config purple ---#
#>
#> |var |value |
#> |:-----|:------|
#> |tool |purple |
#> |nraw |10 |
#> |ntidy |10 |
ppl$config$get_raw_patterns()
#> # A tibble: 10 × 2
#> name value
#> <chr> <chr>
#> 1 cnvgenetsv "\\.purple\\.cnv\\.gene\\.tsv$"
#> 2 cnvsomtsv "\\.purple\\.cnv\\.somatic\\.tsv$"
#> 3 drivercatalog "\\.purple\\.driver\\.catalog\\.(germline|somatic)\\.tsv$"
#> 4 germdeltsv "\\.purple\\.germline\\.deletion\\.tsv$"
#> 5 purityrange "\\.purple\\.purity\\.range\\.tsv$"
#> 6 puritytsv "\\.purple\\.purity\\.tsv$"
#> 7 qc "\\.purple\\.qc$"
#> 8 somclonality "\\.purple\\.somatic\\.clonality\\.tsv$"
#> 9 somhist "\\.purple\\.somatic\\.hist\\.tsv$"
#> 10 version "^purple\\.version$"
ppl$config$get_raw_schema("puritytsv")
#> # A tibble: 25 × 2
#> field type
#> <chr> <chr>
#> 1 purity d
#> 2 normFactor d
#> 3 score d
#> 4 diploidProportion d
#> 5 ploidy d
#> 6 gender c
#> 7 status c
#> 8 polyclonalProportion d
#> 9 minPurity d
#> 10 maxPurity d
#> # ℹ 15 more rows
ppl$config$get_tidy_schema("puritytsv")
#> # A tibble: 25 × 3
#> field type description
#> <chr> <chr> <chr>
#> 1 purity d purity of tumor in the sample
#> 2 norm_factor d internal factor to convert tumor ratio to cn.
#> 3 score d score of fit (lower is better)
#> 4 diploid_proportion d proportion of cn regions that have 1 (+- 0.2) mi…
#> 5 ploidy d average ploidy of the tumor sample after adjusti…
#> 6 gender c one of male or female
#> 7 status c either pass or one or more warning or fail status
#> 8 polyclonal_proportion d proportion of copy number regions that are more …
#> 9 min_purity d minimum purity with score within 10% of best
#> 10 max_purity d maximum purity with score within 10% of best
#> # ℹ 15 more rows
We can list files that can be parsed with
list_files
:
(lf <- ppl$list_files())
#> # A tibble: 11 × 10
#> tool_parser parser bname size lastmodified path pattern prefix
#> <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue>
#> 1 purple_cnvgene… cnvge… LXXX… 1.44K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 2 purple_cnvsomt… cnvso… LXXX… 1.32K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 3 purple_driverc… drive… LXXX… 819 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 4 purple_driverc… drive… LXXX… 468 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 5 purple_germdel… germd… LXXX… 1.24K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 6 purple_purityr… purit… LXXX… 484 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 7 purple_purityt… purit… LXXX… 462 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 8 purple_qc qc LXXX… 228 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 9 purple_somclon… somcl… LXXX… 451 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist somhi… LXXX… 138 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 11 purple_version versi… purp… 39 2025-07-14 11:38:10 /hom… "^purp… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
lf |> dplyr::slice(1) |> str()
#> tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
#> $ tool_parser : 'glue' chr "purple_cnvgenetsv"
#> $ parser : chr "cnvgenetsv"
#> $ bname : chr "LXXXXXXX.purple.cnv.gene.tsv"
#> $ size : 'fs_bytes' num 1.44K
#> $ lastmodified: POSIXct[1:1], format: "2025-07-14 11:38:10"
#> $ path : chr "/home/runner/work/_temp/Library/tidywigits/extdata/purple/LXXXXXXX.purple.cnv.gene.tsv"
#> $ pattern : chr "\\.purple\\.cnv\\.gene\\.tsv$"
#> $ prefix : 'glue' chr "LXXXXXXX"
#> $ schema :List of 1
#> ..$ : tibble [18 × 2] (S3: tbl_df/tbl/data.frame)
#> .. ..$ field: chr [1:18] "chromosome" "start" "end" "gene" ...
#> .. ..$ type : chr [1:18] "c" "d" "d" "c" ...
#> $ group : 'glue' chr ""
We can parse and tidy files of interest using the tidy
function. Note that this function is called on the object and not
assigned anywhere:
# this will create a new field tbls containing the tidy data (and optionally
# the 'raw' parsed data)
ppl$tidy(tidy = TRUE, keep_raw = TRUE)
ppl$tbls
#> # A tibble: 11 × 11
#> tool_parser parser bname size lastmodified path pattern prefix
#> <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue>
#> 1 purple_cnvgene… cnvge… LXXX… 1.44K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 2 purple_cnvsomt… cnvso… LXXX… 1.32K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 3 purple_driverc… drive… LXXX… 819 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 4 purple_driverc… drive… LXXX… 468 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 5 purple_germdel… germd… LXXX… 1.24K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 6 purple_purityr… purit… LXXX… 484 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 7 purple_purityt… purit… LXXX… 462 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 8 purple_qc qc LXXX… 228 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 9 purple_somclon… somcl… LXXX… 451 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist somhi… LXXX… 138 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 11 purple_version versi… purp… 39 2025-07-14 11:38:10 /hom… "^purp… versi…
#> # ℹ 3 more variables: group <glue>, raw <list>, tidy <list>
ppl$tbls$raw[[8]] |> dplyr::glimpse()
#> Rows: 1
#> Columns: 12
#> $ QCStatus <chr> "PASS"
#> $ Method <chr> "NORMAL"
#> $ CopyNumberSegments <chr> "354"
#> $ UnsupportedCopyNumberSegments <chr> "110"
#> $ Purity <chr> "0.8700"
#> $ AmberGender <chr> "MALE"
#> $ CobaltGender <chr> "MALE"
#> $ DeletedGenes <chr> "0"
#> $ Contamination <chr> "0.0"
#> $ GermlineAberrations <chr> "NONE"
#> $ AmberMeanDepth <chr> "73"
#> $ LohPercent <chr> "0.0321"
# the tidy tibbles are nested to allow for more than one tidy tibble per file
ppl$tbls$tidy[[8]][["data"]][[1]] |> dplyr::glimpse()
#> Rows: 1
#> Columns: 12
#> $ qc_status <chr> "PASS"
#> $ method <chr> "NORMAL"
#> $ cn_segments <int> 354
#> $ unsupported_cn_segments <int> 110
#> $ purity <dbl> 0.87
#> $ gender_amber <chr> "MALE"
#> $ gender_cobalt <chr> "MALE"
#> $ deleted_genes <int> 0
#> $ contamination <dbl> 0
#> $ germline_aberrations <chr> "NONE"
#> $ mean_depth_amber <dbl> 73
#> $ loh_percent <dbl> 0.0321
We can also focus on a subset of files to tidy using the
filter_files
function. The include
and
exclude
arguments can specify which parsers to include or
exclude in the analysis:
# create new Purple object
ppl2 <- Purple$new(path = ppl_path)
ppl2$files
#> # A tibble: 11 × 10
#> tool_parser parser bname size lastmodified path pattern prefix
#> <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue>
#> 1 purple_cnvgene… cnvge… LXXX… 1.44K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 2 purple_cnvsomt… cnvso… LXXX… 1.32K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 3 purple_driverc… drive… LXXX… 819 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 4 purple_driverc… drive… LXXX… 468 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 5 purple_germdel… germd… LXXX… 1.24K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 6 purple_purityr… purit… LXXX… 484 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 7 purple_purityt… purit… LXXX… 462 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 8 purple_qc qc LXXX… 228 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 9 purple_somclon… somcl… LXXX… 451 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist somhi… LXXX… 138 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 11 purple_version versi… purp… 39 2025-07-14 11:38:10 /hom… "^purp… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
ppl2$filter_files(include = c("purple_qc", "purple_cnvsomtsv"))
ppl2$files
#> # A tibble: 2 × 10
#> tool_parser parser bname size lastmodified path pattern prefix
#> <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue>
#> 1 purple_cnvsomtsv cnvso… LXXX… 1.32K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 2 purple_qc qc LXXX… 228 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> # ℹ 2 more variables: schema <list>, group <glue>
After tidying the data of interest, we can write the tidy tibbles to
various formats, like Apache Parquet, PostgreSQL, CSV/TSV and R’s RDS.
Below we can see that the id
specified is added to the
written files in an additional nemo_id
column. This can be
used e.g. to distinguish results from different runs in a data pipeline.
When writing to a database like PostgreSQL, another column
nemo_pfix
is used to distinguish results from the same run
from the same tool.
ppl2$tidy() # first need to tidy
outdir1 <- tempdir()
fmt <- "tsv"
ppl2$write(odir = outdir1, format = fmt, id = "run123")
(wfiles <- fs::dir_info(outdir1) |> dplyr::select(1:5))
#> # A tibble: 17 × 5
#> path type size permissions modification_time
#> <fs::path> <fct> <fs::b> <fs::perms> <dttm>
#> 1 …XXXXXX_purple_cnvsomtsv.tsv.gz file 631 rw-r--r-- 2025-07-14 11:39:02
#> 2 …3VFi/LXXXXXXX_purple_qc.tsv.gz file 186 rw-r--r-- 2025-07-14 11:39:02
#> 3 …mp/Rtmphs3VFi/file229926947191 file 4.71K rw-r--r-- 2025-07-14 11:39:01
#> 4 …mp/Rtmphs3VFi/file22993ec683a4 file 4.71K rw-r--r-- 2025-07-14 11:39:01
#> 5 …mp/Rtmphs3VFi/file22993f46862b file 4.71K rw-r--r-- 2025-07-14 11:39:01
#> 6 …mp/Rtmphs3VFi/file229940db5ef8 file 4.71K rw-r--r-- 2025-07-14 11:39:00
#> 7 …mp/Rtmphs3VFi/file2299466fc5ab file 4.71K rw-r--r-- 2025-07-14 11:39:00
#> 8 …mp/Rtmphs3VFi/file229947d54ea2 file 4.71K rw-r--r-- 2025-07-14 11:39:00
#> 9 …mp/Rtmphs3VFi/file22994ee3f90e file 4.71K rw-r--r-- 2025-07-14 11:39:02
#> 10 …mp/Rtmphs3VFi/file2299547d2870 file 4.71K rw-r--r-- 2025-07-14 11:39:01
#> 11 …mp/Rtmphs3VFi/file22995865c599 file 4.71K rw-r--r-- 2025-07-14 11:39:00
#> 12 …mp/Rtmphs3VFi/file229962f9985e file 4.71K rw-r--r-- 2025-07-14 11:39:00
#> 13 …mp/Rtmphs3VFi/file229967124ebe file 4.71K rw-r--r-- 2025-07-14 11:39:00
#> 14 …mp/Rtmphs3VFi/file229973c12047 file 4.71K rw-r--r-- 2025-07-14 11:39:02
#> 15 …mp/Rtmphs3VFi/file229976170c00 file 4.71K rw-r--r-- 2025-07-14 11:39:00
#> 16 /tmp/Rtmphs3VFi/file2299c6faa09 file 4.71K rw-r--r-- 2025-07-14 11:39:01
#> 17 …rmarkdown-str22995a45b6db.html file 1.13K rw-r--r-- 2025-07-14 11:38:59
readr::read_tsv(wfiles$path[2], show_col_types = F)
#> # A tibble: 1 × 13
#> nemo_id qc_status method cn_segments unsupported_cn_segments purity
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 run123 PASS NORMAL 354 110 0.87
#> # ℹ 7 more variables: gender_amber <chr>, gender_cobalt <chr>,
#> # deleted_genes <dbl>, contamination <dbl>, germline_aberrations <chr>,
#> # mean_depth_amber <dbl>, loh_percent <dbl>
The nemofy
function is a convenient wrapper for the
process of filtering, tidying, and writing.
ppl3 <- Purple$new(path = ppl_path)
outdir2 <- file.path(tempdir(), "ppl3") |> fs::dir_create()
ppl3$files
#> # A tibble: 11 × 10
#> tool_parser parser bname size lastmodified path pattern prefix
#> <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue>
#> 1 purple_cnvgene… cnvge… LXXX… 1.44K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 2 purple_cnvsomt… cnvso… LXXX… 1.32K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 3 purple_driverc… drive… LXXX… 819 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 4 purple_driverc… drive… LXXX… 468 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 5 purple_germdel… germd… LXXX… 1.24K 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 6 purple_purityr… purit… LXXX… 484 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 7 purple_purityt… purit… LXXX… 462 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 8 purple_qc qc LXXX… 228 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 9 purple_somclon… somcl… LXXX… 451 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist somhi… LXXX… 138 2025-07-14 11:38:10 /hom… "\\.pu… LXXXX…
#> 11 purple_version versi… purp… 39 2025-07-14 11:38:10 /hom… "^purp… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
ppl3$nemofy(
odir = outdir2,
format = "tsv",
id = "run_ppl3",
exclude = c("purple_cnvgenetsv", "purple_cnvsomtsv", "purple_drivercatalog", "purple_germdeltsv")
)
(wfiles2 <- fs::dir_info(outdir2) |> dplyr::select(1:5))
#> # A tibble: 6 × 5
#> path type size permissions modification_time
#> <fs::path> <fct> <fs:> <fs::perms> <dttm>
#> 1 …XXXXXXX_purple_purityrange.tsv.gz file 204 rw-r--r-- 2025-07-14 11:39:03
#> 2 …/LXXXXXXX_purple_puritytsv.tsv.gz file 303 rw-r--r-- 2025-07-14 11:39:03
#> 3 …Fi/ppl3/LXXXXXXX_purple_qc.tsv.gz file 186 rw-r--r-- 2025-07-14 11:39:03
#> 4 …XXXXXX_purple_somclonality.tsv.gz file 154 rw-r--r-- 2025-07-14 11:39:03
#> 5 …l3/LXXXXXXX_purple_somhist.tsv.gz file 108 rw-r--r-- 2025-07-14 11:39:03
#> 6 …pl3/version_purple_version.tsv.gz file 77 rw-r--r-- 2025-07-14 11:39:03
readr::read_tsv(wfiles2$path[2], show_col_types = F)
#> # A tibble: 1 × 26
#> nemo_id purity norm_factor score diploid_proportion ploidy gender status
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 run_ppl3 0.87 0.528 0.841 0.0036 4.05 MALE NORMAL
#> # ℹ 18 more variables: polyclonal_proportion <dbl>, min_purity <dbl>,
#> # max_purity <dbl>, min_ploidy <dbl>, max_ploidy <dbl>,
#> # min_diploid_proportion <dbl>, max_diploid_proportion <dbl>,
#> # somatic_penalty <dbl>, whole_genome_duplication <lgl>,
#> # ms_indels_per_mb <dbl>, ms_status <chr>, tml <dbl>, tml_status <chr>,
#> # tmb_per_mb <dbl>, tmb_status <chr>, sv_tumor_mutational_burden <dbl>,
#> # run_mode <chr>, targeted <lgl>
Workflow
A Workflow
consists of a list of one or more
Tool
s. We can construct a certain Workflow
with different Tool
s, which would allow parsing and writing
tidy tables from a variety of bioinformatic tools. See
?Workflow
.
The Oncoanalyser Nextflow pipeline uses several
tools from WiGiTS, and we can construct a
Oncoanalyser
class as a Workflow
child based
on a suite of Tool
s under the tidywigits R package. Similarly to
Tool
, a Workflow
object contains functions
such as filter_files
, list_files
,
tidy
, write
and nemofy
:
oa <- system.file("extdata/purple", package = "tidywigits") |>
Oncoanalyser$new()
outdir3 <- file.path(tempdir(), "oa") |> fs::dir_create()
oa$list_files()
#> # A tibble: 11 × 10
#> parser bname size lastmodified path pattern tool_parser prefix
#> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue> <glue>
#> 1 cnvgenetsv LXXX… 1.44K 2025-07-14 11:38:10 /hom… "\\.pu… purple_cnv… LXXXX…
#> 2 cnvsomtsv LXXX… 1.32K 2025-07-14 11:38:10 /hom… "\\.pu… purple_cnv… LXXXX…
#> 3 drivercat… LXXX… 819 2025-07-14 11:38:10 /hom… "\\.pu… purple_dri… LXXXX…
#> 4 drivercat… LXXX… 468 2025-07-14 11:38:10 /hom… "\\.pu… purple_dri… LXXXX…
#> 5 germdeltsv LXXX… 1.24K 2025-07-14 11:38:10 /hom… "\\.pu… purple_ger… LXXXX…
#> 6 purityran… LXXX… 484 2025-07-14 11:38:10 /hom… "\\.pu… purple_pur… LXXXX…
#> 7 puritytsv LXXX… 462 2025-07-14 11:38:10 /hom… "\\.pu… purple_pur… LXXXX…
#> 8 qc LXXX… 228 2025-07-14 11:38:10 /hom… "\\.pu… purple_qc LXXXX…
#> 9 somclonal… LXXX… 451 2025-07-14 11:38:10 /hom… "\\.pu… purple_som… LXXXX…
#> 10 somhist LXXX… 138 2025-07-14 11:38:10 /hom… "\\.pu… purple_som… LXXXX…
#> 11 version purp… 39 2025-07-14 11:38:10 /hom… "^purp… purple_ver… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
x <- oa$nemofy(
odir = outdir3,
format = "tsv",
id = "oa_run1",
exclude = c("cobalt_ratiotsv", "amber_baftsv", "isofox_altsj", "isofox_transdata")
)
(wfiles3 <- fs::dir_info(outdir3) |> dplyr::select(1:5))
#> # A tibble: 11 × 5
#> path type size permissions modification_time
#> <fs::path> <fct> <fs:> <fs::perms> <dttm>
#> 1 …line_purple_drivercatalog.tsv.gz file 363 rw-r--r-- 2025-07-14 11:39:04
#> 2 …XXXXXXX_purple_cnvgenetsv.tsv.gz file 444 rw-r--r-- 2025-07-14 11:39:04
#> 3 …LXXXXXXX_purple_cnvsomtsv.tsv.gz file 633 rw-r--r-- 2025-07-14 11:39:04
#> 4 …XXXX_purple_drivercatalog.tsv.gz file 293 rw-r--r-- 2025-07-14 11:39:04
#> 5 …XXXXXXX_purple_germdeltsv.tsv.gz file 479 rw-r--r-- 2025-07-14 11:39:04
#> 6 …XXXXXX_purple_purityrange.tsv.gz file 202 rw-r--r-- 2025-07-14 11:39:04
#> 7 …LXXXXXXX_purple_puritytsv.tsv.gz file 303 rw-r--r-- 2025-07-14 11:39:04
#> 8 …VFi/oa/LXXXXXXX_purple_qc.tsv.gz file 185 rw-r--r-- 2025-07-14 11:39:04
#> 9 …XXXXX_purple_somclonality.tsv.gz file 153 rw-r--r-- 2025-07-14 11:39:04
#> 10 …a/LXXXXXXX_purple_somhist.tsv.gz file 107 rw-r--r-- 2025-07-14 11:39:04
#> 11 …oa/version_purple_version.tsv.gz file 76 rw-r--r-- 2025-07-14 11:39:04
readr::read_tsv(wfiles3$path[5], show_col_types = F)
#> # A tibble: 10 × 17
#> nemo_id gene chromosome chromosome_band region_start region_end
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 oa_run1 ZYG11B chr1 p32.3 52797001 52798000
#> 2 oa_run1 RPL31P12 chr1 p31.1 72300641 72346158
#> 3 oa_run1 AKNAD1 chr1 p13.3 108824397 108829338
#> 4 oa_run1 SLC16A1-AS1 chr1 p13.2 113006001 113012000
#> 5 oa_run1 LRIG2-DT chr1 p13.2 113006001 113012000
#> 6 oa_run1 GABPB2 chr1 q21.3 151073001 151117000
#> 7 oa_run1 RPS29P29 chr1 q21.3 151073001 151117000
#> 8 oa_run1 LCE1E chr1 q21.3 152787001 152798000
#> 9 oa_run1 LCE1D chr1 q21.3 152787001 152798000
#> 10 oa_run1 PTPRVP chr1 q32.1 202171001 202173000
#> # ℹ 11 more variables: depth_window_count <dbl>, exon_start <dbl>,
#> # exon_end <dbl>, detection_method <chr>, germline_status <chr>,
#> # tumor_status <chr>, germline_cn <dbl>, tumor_cn <dbl>, filter <chr>,
#> # cohort_frequency <dbl>, reported <lgl>